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Abstract 

We study an approach to policy selection for large relational Markov Decision Processes 
(MDPs). We consider a variant of approximate policy iteration (API) that replaces the 
usual value-function learning step with a learning step in policy space. This is advantageous 
in domains where good policies are easier to represent and learn than the corresponding 
value functions, which is often the case for the relational MDPs we are interested in. 
In order to apply API to such problems, we introduce a relational policy language and 
corresponding learner. In addition, we introduce a new bootstrapping routine for goal- 
based planning domains, based on random walks. Such bootstrapping is necessary for 
many large relational MDPs, where reward is extremely sparse, as API is ineffective in 
such domains when initialized with an uninformed policy. Our experiments show that the 
resulting system is able to find good policies for a number of classical planning domains 
and their stochastic variants by solving them as extremely large relational MDPs. The 
experiments also point to some limitations of our approach, suggesting future work. 



1. Introduction 

Many planning domains are most naturally represented in terms of objects and relations 
among them. Accordingly, AI researchers have long studied algorithms for planning and 
learning-to-plan in relational state and action spaces. These include, for example, "classical" 
STRIPS domains such as the blocks world and logistics. 

A common criticism of such domains and algorithms is the assumption of an idealized, 
deterministic world model. This, in part, has led AI researchers to study planning and 
learning within a decision-theoretic framework, which explicitly handles stochastic environ- 
ments and generalized reward-based objectives. However, most of this work is based on 
explicit or propositional state-space models, and so far has not demonstrated scalability to 
the large relational domains that are commonly addressed in classical planning. 

Intelligent agents must be able to simultaneously deal with both the complexity arising 
from relational structure and the complexity arising from uncertainty. The primary goal 
of this research is to move toward such agents by bridging the gap between classical and 
decision-theoretic techniques. 

In this paper, we describe a straightforward and practical method for solving very large, 
relational MDPs. Our work can be viewed as a form of relational reinforcement learning 
(RRL) where we assume a strong simulation model of the environment. That is, we assume 
access to a black-box simulator, for which we can provide any (relationally represented) 
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state/action pair and receive a sample from the appropriate next-state and reward distribu- 
tions. The goal is to interact with the simulator in order to learn a policy for achieving high 
expected reward. It is a separate challenge, not considered here, to combine our work with 
methods for learning the environment simulator to avoid dependence on being provided 
such a simulator. 

Dynamic-programming approaches to finding optimal control policies in MDPs (Bell- 
man, 1957; Howard, 1960), using explicit (flat) state space representations, break down 
when the state space becomes extremely large. More recent work extends these algorithms 
to use propositional (Boutilier & Dearden, 1996; Dean k, Givan, 1997; Dean, Givan, & 
Leach, 1997; Boutilier, Dearden, &; Goldszmidt, 2000; Givan, Dean, & Greig, 2003; Guestrin, 
Roller, Parr, & Venkataraman, 2003b) as well as relational (Boutilier, Reiter, h Price, 2001; 
Guestrin, Roller, Gearhart, & Ranodia, 2003a) state-space representations. These exten- 
sions have significantly expanded the set of approachable problems, but have not yet shown 
the capacity to solve large classical planning problems such as the benchmark problems 
used in planning competitions (Bacchus, 2001), let alone their stochastic variants. One pos- 
sible reason for this is that these methods are based on calculating and representing value 
functions. For familiar STRIPS planning domains (among others), useful value functions 
can be difficult to represent compactly, and their manipulation becomes a bottle-neck. 

Most of the above techniques are purely deductive — that is, each value function is guar- 
anteed to have a certain level of accuracy. Rather, in this work, we will focus on inductive 
techniques that make no such guarantees in practice. Most existing inductive forms of ap- 
proximate policy iteration (API) utilize machine learning to select compactly represented 
approximate value functions at each iteration of dynamic programming (Bertsekas & Tsit- 
siklis, 1996). As with any machine learning algorithm, the selection of the hypothesis space, 
here a space of value functions, is critical to performance. An example space used frequently 
is the space of linear combinations of a human-selected feature set. 

To our knowledge, there has been no previous work that applies any form of API to 
benchmark problems from classical planning, or their stochastic variants. 1 Again, one 
reason for this is the high complexity of typical value functions for these large relational 
domains, making it difficult to specify good value-function spaces that facilitate learning. 
Comparably, it is often much easier to compactly specify good policies, and accordingly 
good policy spaces for learning. This observation is the basis for recent work on induc- 
tive policy selection in relational planning domains, both deterministic (Rhardon, 1999a; 
Martin & Geffner, 2000), and probabilistic (Yoon, Fern, & Givan, 2002). These techniques 
show that useful policies can be learned using a policy-space bias described by a generic 
(relational) knowledge representation language. Here we incorporate those ideas into a vari- 
ant of API, that achieves significant success without representing or learning approximate 
value functions. Of course, a natural direction for future work is to combine policy-space 
techniques with value-function techniques, to leverage the advantages of both. 

Given an initial policy, our approach uses the simulation technique of policy rollout 
(Tesauro & Galperin, 1996) to generate trajectories of an improved policy. These trajecto- 
ries are then given to a classification learner, which searches for a classifier, or policy, that 
"matches" the trajectory data, resulting in an approximately improved policy. These two 

1. Recent work in relational reinforcement learning has been applied to STRIPS problems with much simpler 
goals than typical benchmark planning domains, and is discussed in Section 8. 
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steps are iterated until no further improvement is observed. The resulting algorithm can be 
viewed as a form of API where the iteration is carried out without inducing approximate 
value functions. 

By avoiding value function learning, this algorithm helps address the representational 
challenge of applying API to relational planning domains. However, another fundamental 
challenge is that, for non-trivial relational domains, API requires some form of bootstrap- 
ping. In particular, for most STRIPS planning domains the reward, which corresponds to 
achieving a goal condition, is sparsely distributed and unlikely to be reached by random ex- 
ploration. Thus, initializing API with a random or uninformed policy, will likely result in no 
reward signal and hence no guidance for policy improvement. One approach to bootstrap- 
ping is to rely on the user to provide a good initial policy or heuristic that gives guidance 
toward achieving reward. Rather, in this work we develop a new automatic bootstrapping 
approach for goal-based planning domains, which does not require user intervention. 

Our bootstrapping approach is based on the idea of random- walk problem distributions. 
For a given planning domain, such as the blocks world, this distribution randomly generates 
a problem (i.e., an initial state and a goal) by selecting a random initial state and then 
executing a sequence of n random actions, taking the goal condition to be a subset of 
properties from the resulting state. The problem difficulty typically increases with n, and 
for small n (short random walks) even random policies can uncover reward. Intuitively, a 
good policy for problems with walk length n can be used to bootstrap API for problems with 
slightly longer walk lengths. Our bootstrapping approach iterates this idea, by starting with 
a random policy and very small n, and then gradually increasing the walk length until we 
learn a policy for very long random walks. Such long-random-walk policies clearly capture 
much domain knowledge, and can be used in various ways. Here, we show that empirically 
such policies often perform well on problem distributions from relational domains used in 
recent deterministic and probabilistic planning competitions. 

An implementation of this bootstrapped API approach took second place of 3 competi- 
tors in the hand-tailored track of the 2004 International Probabilistic Planning Competi- 
tion. 2 To our knowledge this is the first machine-learning based system to be entered in 
any planning competition, either deterministic or probabilistic. 

Here, we give an evaluation of our system on a number of probabilistic and deterministic 
relational planning domains, including the AIPS-2000 competition benchmarks, and bench- 
marks from the hand-tailored track of the 2004 Probabilistic Planning Competition. The 
results show that the system is often able to learn policies in these domains that perform 
well for long-random-walk problems. In addition, these same policies often perform well on 
the planning-competition problem distributions, comparing favorably with the state-of-the- 
art planner FF in the deterministic domains. Our experiments also highlight a number of 
limitations of the current system, which point to interesting directions for future work. 

The remainder of paper proceeds as follows. In Section 2, we introduce our problem 
setup and then, in Section 3, present our new variant of API. In Section 4, we provide some 

2. Note, however, that this approach is not hand-tailored. Rather, given a domain definition, our system 
learns a policy offline, automatically, which can then be applied to any problem from the domain. We 
entered the hand-tailored track because it was the only track that facilitated the use of offline learning, 
by providing domains and problem generators before the competition. The other entrants were human- 
written for each domain. 
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technical analysis of the algorithm, giving performance bounds on the policy-improvement 
step. In Sections 5 and 6, we describe an implemented instantiation of our API approach 
for relational planning domains. This includes a description of a generic policy language 
for relational domains, a classification learner for that language, and a novel bootstrapping 
technique for goal-based domains. Section 7 presents our empirical results, and finally 
Sections 8 and 9 discuss related work and future directions. 

2. Problem Setup 

We formulate our work in the framework of Markov Decision Processes (MDPs). While 
our primary motivation is to develop algorithms for relational planning domains, we first 
describe our problem setup and approach for a general, action-simulator-based MDP repre- 
sentation. Later, in Section 5, we describe a particular representation of planning domains 
as relational MDPs and the corresponding relational instantiation of our approach. 

Following and adapting Kearns, Mansour, and Ng (2002) and Bertsekas and Tsitsiklis 
(1996), we represent an MDP using a generative model (S,A,T,R,I), where S is a finite 
set of states, A is a finite, ordered set of actions, and T is a randomized action-simulation 
algorithm that, given state s and action a, returns a next state s' according to some unknown 
probability distribution Pt(s'\s, a). The component R is a reward function that maps S x A 
to real-numbers, with R(s, a) representing the reward for taking action a in state s, and I 
is a randomized initial-state algorithm with no inputs that returns a state s according to 
some unknown distribution Po(s). We sometimes treat I and T(s,a) as random variables 
with distributions Pq(-) and Pr(-\s,a) respectively. 

For an MDP M = (S, A, T, R,I), a policy tt is a (possibly stochastic) mapping from S to 
A. The value function of tt, denoted V r (s), represents the expected, cumulative, discounted 
reward of following policy tt in M starting from state s, and is the unique solution to 

V"(s) = E[R(s, tt(s)) + iV*(T{a, tt{s)))] (1) 

where < 7 < 1 is the discount factor. The Q-value function Q 7T (s,a) represents the 
expected, cumulative, discounted reward of taking action a in state s and then following tt, 
and is given by 

Q>, a) = R(s, a) + 7E[V*(T(s, a))} (2) 

We will measure the quality of a policy by the objective function V{tt) = E[V W (I)], giving 
the expected value obtained by that policy when starting from a randomly drawn initial 
state. A common objective in MDP planning and reinforcement learning is to find an 
optimal policy tt* = argmax 7r F(7r). However, no automated technique, including the one 
we present here, has to date been able to guarantee finding an optimal policy in the relational 
planning domains we consider, in reasonable running time. 

It is a well known fact that given a current policy n, we can define a new improved 
policy 

Pr(s) = argmax aeA Q w (s, a) (3) 

such that the value function of VI 1 * is guaranteed to 1) be no worse than that of tt at each 
state s, and 2) strictly improve at some state when tt is not optimal. Policy iteration is an 
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algorithm for computing optimal policies by iterating policy improvement (VI) from any 
initial policy to reach a fixed point, which is guaranteed to be an optimal policy. Each 
iteration of policy improvement involves two steps: 1) Policy evaluation where we compute 
the value function V w of the current policy tt, and 2) Policy selection, where, given V* from 
step 1, we select the action that maximizes Q w (s, a) at each state, defining a new improved 
policy. 

Finite Horizons. Since our API variant is based on simulation, and must bound the 
simulation trajectories by a horizon h, our technical analysis in Section 4 will use the notion 
of finite-horizon discounted reward. The /i-horizon value function Vfi is recursively defined 
as 



giving the expected discounted reward obtained by following ir for h steps from s. We also 
define the /i-horizon Q-function Q^(s,a) = R(s,a) + jE[V^_ 1 (T(s, a))], and the /i-horizon 
objective function Vh(n) = E[Vfi(I)]. It is well known, that the effect of using a finite 
horizon can be made arbitrarily small. In particular, we have that for all states s and 
actions a, the approximation error decreases exponentially with h, 



where Rm&x is the maximum of the absolute value of the reward for any action at any state. 
From this we also get that \Vh{^) — V(?r)| < j h V max . 

3. Approximate Policy Iteration with a Policy Language Bias 

Exact solution techniques, such as policy iteration, are typically intractable for large state- 
space MDPs, such as those arising from relational planning domains. In this section, we 
introduce a new variant of approximate policy iteration (API) intended for such domains. 
First, we review a generic form of API used in prior work, based on learning approximate 
value functions. Next, motivated by the fact that value functions are often difficult to learn 
in relational domains, we describe our API variant, which avoids learning value functions 
and instead learns policies directly as state-action mappings. 

3.1 API with Approximate Value Functions 

API, as described in Bertsekas and Tsitsiklis (1996), uses a combination of Monte-Carlo 
simulation and inductive machine learning to heuristically approximate policy iteration in 
large state-space MDPs. Given a current policy tt, each iteration of API approximates 
policy evaluation and policy selection, resulting in an approximately improved policy tt. 
First, the policy-evaluation step constructs a training set of samples of V w from a small but 
representative set of states. Each sample is computed using simulation, estimating V n (s) 
for the policy tt at each state s by drawing some number of sample trajectories of tt starting 



Vf(s) = 0, V£(s) = E[R(s,tt(s)) + 7 Vfc_ 1 (T(s, 7 r( S )))] 



(4) 




max — 



and 
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at s and then averaging the cumulative, discounted reward along those trajectories. Next, 
the policy-selection step uses a function approximator (e.g., a neural network) to learn an 
approximation V n to V w based on the training data. V* then serves as a representation 
for 7r, which selects actions using sampled one-step lookahead based on V n , that is 

tt(s) = aigm&x R(s, a) + jE{V 7T (T(s, a))}. 

A common variant of this procedure learns an approximation of Q 71 rather than V 71 . 

API exploits the function approximator's generalization ability to avoid evaluating each 
state in the state space, instead only directly evaluating a small number of training states. 
Thus, the use of API assumes that states and perhaps actions are represented in a factored 
form (typically, a feature vector) that facilitates generalizing properties of the training data 
to the entire state and action spaces. Note that in the case of perfect generalization (i.e., 
V 7T (s) = V n (s) for all states s), we have that tt is equal to the exact policy improvement 
VI*, and thus API simulates exact policy iteration. However, in practice, generalization is 
not perfect, and there are typically no guarantees for policy improvement 3 — nevertheless, 
API often "converges" usefully (Tesauro, 1992; Tsitsiklis & Van Roy, 1996). 

The success of the above API procedure depends critically on the ability to represent 
and learn good value-function approximations. For some MDPs, such as those arising from 
relational planning domains, it is often difficult to specify a space of value functions and 
learning mechanism that facilitate good generalization. For example, work in relational 
reinforcement learning (Dzeroski, DeRaedt, & Driessens, 2001) has shown that learning 
approximate value functions for classical domains, such as the blocks world, can be prob- 
lematic. 4 In spite of this, it is often relatively easy to compactly specify good policies using 
a language for (relational) state-action mappings. This suggests that such languages may 
provide useful policy-space biases for learning in API. However, all prior API methods are 
based on approximating value functions and hence can not leverage these biases. With 
this motivation, we consider a form of API that directly learns policies without directly 
representing or approximating value functions. 

3.2 Using a Policy Language Bias 

A policy is simply a classifier, possibly stochastic, that maps states to actions. Our API 
approach is based on this view, and is motived by recent work that casts policy selection 
as a standard classification learning problem. In particular, given the ability to observe 
trajectories of a target policy, we can use machine learning to select a policy, or classifier, 
that mimics the target as closely as possible. Khardon (1999b) studied this learning setting 
and provided PAC-like learnability results, showing that under certain assumptions, a small 
number of trajectories is sufficient to learn a policy whose value is close to that of the 
target. In addition, recent empirical work, in relational planning domains (Khardon, 1999a; 
Martin & Geffner, 2000; Yoon et al., 2002), has shown that by using expressive languages 

3. Under very strong assumptions, API can be shown to converge in the infinite limit to a near-optimal 
value function. See Proposition 6.2 of Bertsekas and Tsitsiklis (1996). 

4. In particular, the RRL work has considered a variety of value-function representation including relational 
regression trees, instance based methods, and graph kernels, but none of them have generalized well over 
varying numbers of objects. 
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for specifying state-action mappings, good policies can be learned from sample trajectories 
of good policies. 

These results suggest that, given a policy tt, if we can somehow generate trajectories of 
an improved policy, then we can learn an approximately improved policy based on those 
trajectories. This idea is the basis of our approach. Figure 1 gives pseudo-code for our API 
variant, which starts with an initial policy ttq and produces a sequence of approximately 
improved policies. Each iteration involves two primary steps: First, given the current 
policy tt, the procedure Improved- Trajectories (approximately) generates trajectories of 
the improved policy tt' = VI n . Second, these trajectories are used as training data for the 
procedure Learn-Policy, which returns an approximation of tt'. We now describe each 
step in more detail. 

Step 1: Generating Improved Trajectories. Given a base policy tt, the simula- 
tion technique of policy rollout (Tesauro & Galperin, 1996; Bertsekas & Tsitsiklis, 1996) 
computes an approximation tt to the improved policy tt' = VI W , where tt' is the result of 
applying one step of policy iteration to tt. Furthermore, for a given state s, policy rollout 
computes fr(s) without the need to solve for tt' at all other states, and thus provides a 
tractable way to approximately simulate the improved policy tt' in large state-space MDPs. 
Often tt' is significantly better than tt, and hence so is tt, which can lead to substantially 
improved performance at a small cost. Policy rollout has provided significant benefits in a 
number of application domains, including for example Backgammon (Tesauro & Galperin, 
1996), instruction scheduling (McGovern, Moss, & Barto, 2002), network-congestion control 
(Wu, Chong, & Givan, 2001), and Solitaire (Yan, Diaconis, Rusmevichientong, & Van Roy, 
2004). 

Policy rollout computes vr(s), the estimate of tt'(s), by estimating Q 77 (s,a) for each 
action a and then taking the maximizing action to be fr(s) as suggested by Equation 3. 
Each Q w (s,a) is estimated by drawing w trajectories of length h, where each trajectory is 
the result of starting at s, taking action a, and then following the actions selected by tt for 
h — 1 steps. The estimate of Q n (s,a) is then taken to be the average of the cumulative 
discounted reward along each trajectory. The sampling width w and horizon h are specified 
by the user, and control the trade off between increased computation time for large values, 
and reduced accuracy for small values. Note that rollout applies to both stochastic and 
deterministic policies and that due to variance in the Q-value estimates, the rollout policy 
can be stochastic even for deterministic base policies. 

The procedure Improved- Trajectories uses rollout to generate n length h trajectories 
of tt, each beginning at a randomly drawn initial state. Rather than just recording the 
states and actions along each trajectory, we store additional information that is used by 
our policy-learning algorithm. In particular, the i'th element of a trajectory has the form 
(sj, ir(si), Q(si, oli), . . . ,Q(si,a m )), giving the i'th state Si along the trajectory, the action 
selected by the current (unimproved) policy at Sj, and the Q-value estimates Q(si,a) for 
each action. Note that given the Q-value information for Si the learning algorithm can 
determine the approximately improved action tt(s), by maximizing over actions, if desired. 

Step 2: Learn Policy. Intuitively, we want Learn-Policy to select a new policy 
that closely matches the training trajectories. In our experiments, we use relatively simple 
learning algorithms based on greedy search within a space of policies specified by a policy- 
language bias. In Sections 5.2 and 5.3 we detail the policy-language learning bias used 
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by our technique, and the associated learning algorithm. In Section 4 we provide some 
technical analysis of an idealized version of this algorithm, providing guidance regarding 
the required number of training trajectories. We note that by labeling each training state 
in the trajectories with the associated Q-values for each action, rather than simply with the 
best action, we enable the learner to make more informed trade-offs, focusing on accuracy 
at states where wrong decisions have high costs, which was empirically useful. Also, the 
inclusion of ir(s) in the training data enables the learner to adjust the data relative to tt, 
if desired — e.g., our learner uses a bias that focuses on states where large improvement 
appears possible. 

Finally, we note that for API to be effective, it is important that the initial policy 
7To provide guidance toward improvement, i.e., ttq must bootstrap the API process. For 
example, in goal-based planning domains ttq should reach a goal from some of the sampled 
states. In Section 6 we will discuss this important issue of bootstrapping and introduce a 
new bootstrapping technique. 

4. Technical Analysis 

In this section, we consider a variant of the policy improvement step of our main API loop, 
which learns an improved policy given a base policy tt. We show how to select a sampling 
width w, horizon h, and training set size n such that, under certain assumptions, the quality 
of the learned policy is close to the quality of tt 1 the policy iteration improvement. Similar 
results have been shown for previous forms of API based on approximate value functions 
(Bertsekas & Tsitsiklis, 1996), however, our assumptions are of a much different nature. 5 

The analysis is divided into two parts. First, following Khardon (1999b), we consider 
the sample complexity of policy learning. That is, we consider how many trajectories of a 
target policy must be observed by a learner before we can guarantee a good approximation 
to the target. Second, we show how to apply this result, which is for deterministic policies, 
to the problem of learning from rollout policies, which can be stochastic. Throughout we 
assume the context of an MDP M = (S, A, T, R, I). 

4.1 Learning Deterministic Policies 

A trajectory of length h is a sequence (sq, ao, s\ , a\ , . . . , a^-i, S/J of alternating states Sj and 
actions ctj. We say that a deterministic policy tt is consistent with a trajectory (si, a\, . . . , Sh) 
if and only if for each < i < h, 7r(sj) = o«. We define D\ to be a distribution over the set 
of all length h trajectories, such that D^(t) is the probability that tt generates trajectory 
t = (so, ao, s±, oi, . . . , ah— i, Sh) according to the following process: first draw sq according 
to the initial state distribution /, and then draw Si+i from T(si,ir(si)) for < i < h. Note 
that D^(t) is non-zero only if tt is consistent with t. 

Our policy improvement step first generates trajectories of the rollout policy tt (see Sec- 
tion 3.2), via the procedure Improved- Trajectories, and then learns an approximation 

5. In particular, Bertsekas and Tsitsiklis (1996) assumes a bound on the L x norm of the value function 
approximation, i.e., that at each state the approximation is almost perfect. Rather we assume that 
the improved policy tt' comes from a finite class of policies for which we have a consistent learner. 
In both cases policy improvement can be guaranteed given an additional assumption on the minimum 
Q-advantage of the MDP (see below). 
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API (n, w, h, M, 7r , 7) 

// training set size n, sampling width w, horizon h, 

// MDP M = (S, {ai, . . . , a m }, T, R, I), initial policy tt , discount factor 7. 

TT <— 7r ; 

loop 

T <— Improved- Trajectories(n, w, h, M, tt); 
tt <— Learn-Policy(T); 

until satisfied with tt; // e.g., until change is small 

Return tt; 

Improved- Trajectories^, w, h, M, tt) 

// training set size n, sampling width w, 
// horizon h, MDP M , current policy tt 

repeat n times // generate n trajectories of improved policy 
t <— nil; 

s <— state drawn from /; // draw random initial state 
for i = 1 to h 

(Q(s, a\), . . . , Q(s, a m )) <— Policy-Rollout(7r, s, w, h, M); // Q v (s, a) estimates 
t *— t • (s,Tr(s),Q(s,ai), . . . ,Q(s,a m ))}; // concatenate new sample onto trajectory 

a <— action maximizing Q(s, a); // action of the improved policy at state s 
s <— state sampled from T(s, a); // simulate action of improved policy 
T <- TUt; 
Return T; 

Policy-Rollout (it, s, w, h, M) // Compute Q 7T (s, a) estimates (Q(s, ai), . . . , Q(s, a m )) 
// policy tt, state s, sampling width w, horizon h, MDP M 

for each action aj in A 
Q{s,ai) <- 0; 

repeat w times // Q(s,ai) is an average over w trajectories 

R <— R(s,ai); s' <— a state sampled from T(s, a^); // take action a, in s 

for i = 1 to h — 1 // take h — 1 steps 0/ 7r accumulating discounted reward in R 

R <- J R + 7 J i?(s', 7 r(s')); 

s' <— a state sampled from T(s',7r(s')) 

Q(s, ^) <— Q(s, 0^) + J?; // include trajectory in average 

Return (Q(s, ai), . . . , Q(s, a m )) 

Figure 1: Pseudo-code for our API algorithm. See Section 5.3 for an instantiation of Learn- 
Policy called Learn-Decision-List. 

of tt. Note that the rollout policy serves as a stochastic approximation of tt' = VI W the 
policy iteration improvement of tt. Thus, Improved- Trajectories can be viewed as at- 
tempting to draw trajectories from , and the learning step can be viewed as learning an 
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approximation of tt' . Imagining for the moment that we can draw trajectories from D^' , 
a fundamental question is how many trajectories are sufficient to ensure that the learned 
policy will be about as good as tt'. Khardon (1999b) studied this question for the case 
of deterministic policies in undiscounted goal-based planning domains (i.e., MDPs where 
reward is only received at goal states). Here we give a straightforward adaptation of his 
main result to our problem setting where we have general reward functions and measure 
the quality of a policy by V(tt). 

The learning-problem formulation is similar in spirit to the standard framework of Prob- 
ably Approximately Correct (PAC) learning. In particular, we will assume that the target 
policy comes from a finite class of deterministic policies H. For example, H may correspond 
to the set of policies that can be described by bounded-length decision lists. In addition, 
we assume that the learner is consistent — i.e., it returns a policy from H that is consistent 
with all of the training trajectories. Under these assumptions, a relatively small number 
of trajectories (logarithmic in \H\) are sufficient to ensure that with high probability the 
learned policy is about as good as the target. 

Proposition 1. Let H be a finite class of deterministic policies. For any tt € H , and any 

set of n = e In trajectories drawn independently from D^, there is a 1 — 5 probability 
that every tt G H consistent with the trajectories satisfies V(tt) > V(tt) — 2V rnax (e + r y h ). 

The proof of this proposition is in the Appendix. The computational complexity of 
finding a consistent policy depends on the policy class H. Polynomial-time algorithms 
can be given for interesting classes such as bounded-length decision lists — however, these 
algorithms are typically too expensive for the policy classes we consider in practice. Rather, 
as described in Section 5.3, we use a learner based on greedy heuristic search, which often 
works well in practice. 

The assumption that the target policy comes from a fixed size class H will often be 
violated. However, as pointed out by Khardon (1999b), it is straightforward to give an 
extension of Proposition 1 for the setting where the learner considers increasingly complex 
policies until a consistent one is found. In this case, the sample complexity is related to the 
encoding size of the target policy rather than the size of H, thus allowing the use of very 
large and expressive policy classes without necessarily paying the full sample-complexity 
price of Proposition 1. 

4.2 Learning from Rollout Policies 

The proof of Proposition 1 relies critically on the fact that the policy class H contains only 
deterministic policies. However, in our main API loop, the target policies are computed via 
rollout and hence are stochastic due to the uncertainty introduced by finite sampling. Thus, 
we cannot directly use Proposition 1 in the context of learning from trajectories produced 
by rollout. To deal with this problem we describe a variant of Improved- Trajectories 
that can reliably generate training trajectories from the deterministic policy tt' = VI n (see 
Equation 3), which is guaranteed to improve on tt if improvement is possible. 

Given a base policy tt, we first define A w (s) to be the set of actions that maximize 
Q n (s,a). Note that tt'(s) = min A* (s) , where the minimum is taken with respect to the ac- 
tion ordering provided by the MDP. Importantly this policy is deterministic and thus if we 
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can generate trajectories of it, then we can apply the above result to learn a close approxi- 
mation. In order to generate trajectories of it' we slightly modify Improved- Trajectories. 
The modification is introduced for analysis only, and our experiments are based on the pro- 
cedures given in Figure 1. Our modification is to replace the action maximization step of 
Improved- Trajectories (second to last statement of the for loop), which chooses the next 
action a to execute, with the following two steps 

A(A, s) <- {a' | max a QO, a) - Q(s, a') < A} 
a <— miny4(A,s) 

where Q(s, a) is the estimate of Q\{s, a) computed by policy rollout using a sampling width 
w, and A is a newly introduced parameter. 

Note that if A(A,s) = A n (s), then the selected action a will equal tt'(s). If this con- 
dition is true for every state encountered then the modified Improved- Trajectories will 
effectively generate trajectories of it'. Thus, we would like to bound the probability that 
A(A,s) / A n (s) to a small value by appropriately choosing the sampling width w, the 
horizon h, and A. Unfortunately, the choice of these parameters depends on the MDP. 
That is, given any particular parameter values, there is an MDP such that the event 
A(A,s) ^ A n (s) has a non-negligible probability at some state. For this reason we first 
define the Q-advantage A* of an MDP and show how to select appropriate parameter values 
given a lower-bound on A*. 

Given an MDP and policy it, let S' be the set of states such that s G <S" iff there are 
two actions a and a' such that Q w (s,a) ^ Q 7r (s,a'), i.e., there are actions with distinct 
Q-values. Also for each state in S' define a*(s) and a^s) be a best action and a second 
best action respectively as measured by Q n (s,a). The Q-advantage is defined as A* = 
mines'/ a\(s) — af, (s), which measures the minimum Q- value gap between an optimal and 
sub-optimal action over the state space. Given a lower-bound on the Q-advantage of an 
MDP the following proposition indicates how to select parameter values to ensure that 
A(A,s) = A n (s) with high probability. 

Proposition 2. For any MDP with Q-advantage at least A* , and any < 5' < 1, if we 

have 

h 
w 
A 

then for any state s, A(A, s) = A w (s) with probability at least 1 — 5'. 

The proof is given in the Appendix. Thus, for parameter values satisfying the above condi- 
tions, if our MDP has Q-advantage at least A* then we are guaranteed that with probability 
at least 1 — 5' that A(A, s) = A(A, s). This means that Improved-Trajectories will cor- 
rectly select the action vr'(s) with probability at least 1 — 5'. Note that this proposition 



A* 
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agrees with the intuition that both h and w should increase with decreasing Q-advantage 
and increasing V max — and also that w should increase for decreasing <5'. 6 

In order to generate n length h trajectories of ir', the modified Improved- Trajectories 
routine must compute the set A(A, •) at n ■ h states, yielding n ■ h opportunities to make an 
error. To ensure that no error is made, the modified procedure sets the sampling width w 
using 5' = ^j;. This guarantees that an error free training set is created with probability 
at least 1 — |. 

Combining this observation with the assumption that it' G H we can apply Proposition 
1 as follows. First, generate n = e^ 1 In trajectories of it' using the modified Improved- 
Trajectories routine (with 5' = )• Next, learn a policy 7r from these trajectories using 
a consistent learner. We know that the probability of generating an imperfect training set 
is bounded by |, and for the chosen value of n, the failure probability of the learner is 
also bounded by |. Thus, we get that with probability at least 1 — 5, the learned policy n 
satisfies V(tt) > V(tt') — 2V maiX (e + j h ), giving an approximation guarantee relative to the 
improved policy tt'. This is summarized by the following proposition. 

Proposition 3. Let H be a finite class of deterministic policies, < <5 < 1, and < e < 1. 
For any MDP with Q-advantage at least A* , any policy it such that VI n G H, and any set 
of n > e _1 In (2|i7|<5 _1 ) trajectories produced by modified Improved- Trajectories using 
parameters satisfying, 

A = T 

A* 

h > log 



(8V max y in 2nh\A\ 
\ A* J 5 

there is at least a 1 — 5 probability that every tt £ H consistent with the trajectories satisfies 
V(K)>V(Vr)-2V max {e + 1 h ). 

One notable aspect of this result is that there is only a logarithmic dependence on the 
number of actions |^4| and However, the practical utility is hindered by its dependence 
on A* which is typically not known in practice, and can be exponentially small in the 
planning horizon. Unfortunately, this dependence appears to be unavoidable for our type 
of approach where we try to learn from trajectories of VI n produced by rollout. This is 
because for any particular setting of the above parameters, there is always an MDP with 
a small enough Q-advantage, such that the value of the rollout policy is arbitrarily worse 
than that of VT. 

5. API for Relational Planning 

Our work is motivated by the goal of solving relational MDPs. In particular, we are inter- 
ested in finding policies for relational MDPs that represent classical planning domains and 

6. At first glance it appears that the lower-bound on h decreases with increasing Vmax and decreasing A*. 
However, the opposite is true since the base of the logarithm is the discount factor, which is strictly less 
than one. Also note that since A* is upper-bounded by 2Vhiax the bound on h will always be positive. 
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their stochastic variants. Such policies can then be applied to any problem instance from a 
planning domain, and hence can be viewed as a form of domain-specific control knowledge. 

In this section, we first describe a straightforward way to view classical planning domains 
(not just single problem instances) as relationally factored MDPs. Next, we describe our 
relational policy space in which policies are compactly represented as taxonomic decision 
lists. Finally, we present a heuristic learning algorithm for this policy space. 

5.1 Planning Domains as MDPs. 

We say that an MDP (S, A, T, R, I) is relational when S and A are defined by giving a finite 
set of objects O, a finite set of predicates P, and a finite set of action types Y. A fact is 
a predicate applied to the appropriate number of objects, e.g., on(a, b) is a blocks-world 
fact. A state is a set of facts, interpreted as representing the true facts in the state. The 
state space S contains all possible sets of facts. An action is an action type applied to the 
appropriate number of objects, e.g., putdown(a) is a blocks- world action, and the action 
space A is the set of all such actions. 

A classical planning domain describes a set of problem instances with related structure, 
where a problem instance gives an initial world state and goal. For example, the blocks 
world is a classical planning domain, where each problem instance specifies an initial block 
configuration and a set of goal conditions. Classical planners attempt to find solutions to 
specific problem instances of a domain. Rather, our goal is to solve entire planning domains 
by finding a policy that can be applied to all problem instances. As described below, it is 
straightforward to view a classical planning domain as a relational MDP where each MDP 
state corresponds to a problem instance. 

State and Action Spaces. Each classical planning domain specifies a set of action 
types Y, world predicates W, and possible world objects O. Together Y and O define the 
MDP action space. Each state of the MDP corresponds to a single problem instance (i.e., a 
world state and a goal) from the planning domain by specifying both the current world and 
the goal. We achieve this by letting the set of relational MDP predicates be P = W U G, 
where G is a set of goal predicates. The set of goal predicates contains a predicate for 
each world predicate in W, which is named by prepending a 'g' onto the corresponding 
world predicate name (e.g., the goal predicate gclear corresponds to the world predicate 
clear). With this definition of P we see that the MDP states are sets of goal and world 
facts, indicating the true world facts of a problem instance and the goal conditions. It 
is important to note, as described below, that the MDP actions will only change world 
facts and not goal facts. Thus, this large relational MDP can be viewed as a collection of 
disconnected sub-MDPs, where each sub-MDP corresponds to a distinct goal condition. 

Reward Function. Given an MDP state the objective is to reach another MDP state 
where the goal facts are a subset of the corresponding world facts — i.e., reach a world state 
that satisfies the goal. We will call such states goal states of the MDP. For example, the 
MDP state 

{on-table(a), on(a, b), clear (b), gclear (&)} 

is a goal state in a blocks-world MDP, but would not be a goal state without the world fact 
clear (b). We represent the objective of reaching a goal state quickly by defining R to assign 
a reward of zero for actions taken in goal states and negative rewards for actions in all 
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other states, representing the cost of taking those actions. Typically, for classical planning 
domains, the action costs are uniformly -1, however, our framework allows the cost to vary 
across actions. 

Transition Function. Each classical planning domain provides an action simulator 
(e.g., as defined by STRIPS rules) that, given a world state and action, returns a new world 
state. We define the MDP transition function T to be this simulator modified to treat goal 
states as terminal and to preserve without change all goal predicates in an MDP state. Since 
classical planning domains typically have a large number of actions, the action definitions 
are usually accompanied by preconditions that indicate the legal actions in a given state, 
where usually the legal actions are a small subset of all possible actions. We assume that 
T treats actions that are not legal as no-ops. For simplicity, our relational MDP definition 
does not explicitly represent action preconditions, however, we assume that our algorithms 
do have access to preconditions and thus only need to consider legal actions. For example, 
we can restrict rollout to only the legal actions in a given state. 

Initial State Distribution. Finally, the initial state distribution / can be any program 
that generates legal problem instances (MDP states) of the planning domain. For exam- 
ple, problem domains from planning competitions are commonly distributed with problem 
generators. 

With these definitions, a good policy is one that can reach goal states via low-cost 
action sequences from initial states drawn from /. Note that here policies are mappings 
from problem instances to actions and thus can be sensitive to goal conditions. In this 
way, our learned policies are able to generalize across different goals. We next describe a 
language for representing such generalized policies. 

5.2 Taxonomic Decision List Policies. 

For single argument action types, many useful rules for planning domains take the form of 
"apply action type A to any object in class C" (Martin & Geffner, 2000). For example, in the 
blocks world, "pick up any clear block that belongs on the table but is not on the table", 
or in a logistics world, "unload any object that is at its destination". Using a concept 
language for describing object classes, Martin and Geffner (2000) introduced the use of 
decision lists of such rules as a useful learning bias, showing promising experiments in the 
deterministic blocks world. With that motivation, we consider a policy space that is similar 
to the one used originally by Martin and Geffner, but generalized to handle multiple action 
arguments. Also, for historical reasons, our concept language is based upon taxonomic 
syntax (McAUester, 1991; McAllester & Givan, 1993), rather than on description logic as 
used by Martin and Geffner. 

Comparison Predicates. For relational MDPs with world and goal predicates, such 
as those corresponding to classical planning domains, it is often useful for polices to compare 
the current state with the goal. To this end, we introduce a new set of predicates, called 
comparison predicates, which are derived from the world and goal predicates. For each 
world predicate p and corresponding goal predicate gp, we introduce a new comparison 
predicate cp that is defined as the conjunction of p and gp. That is, a comparison-predicate 
fact is true if and only if both the corresponding world and goal predicates facts are true. 
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For example, in the blocks world, the comparison-predicate fact con(a, b) indicates that a 
is on b in both the current state and the goal — i.e., on(a, b) and gon(a, b) are true. 

Taxonomic Syntax. Taxonomic syntax provides a language for writing class expres- 
sions that represent sets of objects with properties of interest and serve as the fundamental 
pieces with which we build policies. Class expressions are built from the MDP predicates 
(including comparison predicates if applicable) and variables. In our policy representation, 
the variables will be used to denote action arguments, and at runtime will be instantiated 
by objects. For simplicity we only consider predicates of arity one and two, which we call 
primitive classes and relations, respectively. When a domain contains predicates of arity 
three or more, we automatically convert them to multiple auxiliary binary predicates. Given 
a list of variables X = (x±, . . . , X/.), class expressions are given by, 

C[X] ::= C | Si | a-thing | ->C[X\ | {R C[X]) | (min R) 
R y.= Rq I R ^ \ R* 

where C[X] is a class expression, R is a relation expression, Co is a primitive class, Rq is 
a primitive relation, and Xi is a variable in X. Note that, for classical planning domains, 
the primitive classes and relations can be world, goal, or comparison predicates. We de- 
fine the depth d(C[X]) of a class expression C[X] to be one if C[X] is either a primitive 
class, a-thing, a variable, or (min R), otherwise we define d{-^C[X\) and d{R C\X]) to be 
d(C[X]) + 1, where R is a relation expression and C[X] is a class expression. For a given 
relational MDP we denote by Cd[X] the set of all class expressions C[X] that have a depth 
of d or less. 

Intuitively the class expression (R C[X]) denotes the set of objects that are related 
through relation R to some object in the set C[X\. The expression (R* C[X]) denotes 
the set of objects that are related through some U R chain" to an object in C[X\ — this 
constructor is important for representing recursive concepts (e.g., the blocks above a). The 
expression (mini?) denotes the set of objects that are minimal under the relation R. 

More formally, let s be an MDP state and O = (o±, . . . , o&) be a variable assignment, 
which assigns object Oj to variable x«. The interpretation of C[X] relative to s and O is a 
set of objects and is denoted by C[X] S '° . A primitive class Co is interpreted as the set of 
objects for which the predicate symbol Co is true in s. Likewise, a primitive relation Rq is 
interpreted as the set of all object tuples for which the relation i?o holds in s. The class 
expression a-thing denotes the set of all objects in s. The class expression Xi, where Xi 
is a variable, is interpreted to be the singleton set {oj}. The interpretation of compound 
expressions is given by, 

(-C[A]) S '° = {o | o C[X] S '°} 

(R C[X]) S '° = {o | 3d G C[X] S '° s.t. (d, o) G R s '°} 

{min i?) s '° = {o | 3d s.t. (o, d) G R s '° , fid s.t. (d, o) G R s >°} 

(iT) s '° = ID U {(oi, o v ) | 3o 2 , o v -! s.t. (oi, Oi+i) G R s '° for 1 < i < v} 

( R -y,o = {(0)0 ')| (o ' )0)G ^o } 

where C[X] is a class expression, R is a relation expression, and ID is the identity relation. 
Some examples of useful blocks-world concepts, given the primitive classes clear, gclear, 
holding, and con-table, along with the primitive relations on, gon, and con, are: 
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• (gon 1 holding) has depth two, and denotes the block that we want under the block 
being held. 

• (on* (on gclear)) has depth three, and denotes the blocks currently above blocks 
that we want to make clear. 

• (con* con-table) has depth two, and denotes the set of blocks in well constructed 
towers. To see this note that a block b v is in this class if and only if there exists a 
sequence of blocks b±, . . . ,b v such that 61 is on the table in both the goal and the 
current state (i.e. con-table(6i)) and is on b{ in both the goal and current state 
(i.e. con(6j, 6j + i)) for 1 < i < v. 

• (gon (con* con-table)) has depth three, and denotes the blocks that belong on top 
of a currently well constructed tower. 

Decision List Policies We represent policies as decision lists of action-selection rules. 
Each rule has the form a(x±, . . . , x^) ■ L±, L2, ■ . ■ L m , where a is a /c-argument action type, 
the Li are literals, and the x% are action- argument variables. We will denote the list of 
action argument variables as X = (x±, . . . , X&). Each literal has the form x E C[X], where 
C[X] is a taxonomic syntax class expression and x is an action-argument variable. 

Given an MDP state s and a list of action-argument objects O = (01, . . . , we say 
that a literal Xi G C[X] is true given s and O iff o% € C[X] S, ° . We say that a rule 
R = a(x±, . . . , Xk) ■ Li, L2, . . . L m allows action a(oi, . . . Ok) in s iff each literal in the rule 
is true given s and O. Note that if there are no literals in a rule for action type a, then all 
possible actions of type a are allowed by the rule. A rule can be viewed as placing mutual 
constraints on the tuples of objects that an action type can be applied to. Note that a 
single rule may allow no actions or many actions of one type. Given a decision list of such 
rules we say that an action is allowed by the list if it is allowed by some rule in the list, 
and no previous rule allows any actions. Again, a decision list may allow no actions or 
multiple actions of one type. A decision list L for an MDP defines a deterministic policy 
7r[L] for that MDP. If L allows no actions in state s, then 7r[L](s) is the least 7 legal action 
in s; otherwise, tt[L](s) is the least legal action that is allowed by L. It is important to 
note that since ir[L] only considers legal actions, as specified by action preconditions, the 
rules do not need to encode the preconditions, which allows for simpler rules and learning. 
In other words, we can think of each rule as implicitly containing the preconditions of its 
action type. 

As an example of a taxonomic decision list policy consider a simple blocks-world domain 
where the goal condition is always to clear off all of the red blocks. The primitive classes 
in this domain are red, clear, and holding, and the single relation is on. The following 
policy will solve any problem in the domain. 

putdown(xi) : x\ £ holding 

pickup(xi) : X\ G clear, x\ € (on* (on red)) 

7. The action ordering in a relational MDP is defined lexicographically in terms of orderings on the action 
types and objects. 
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The first rule will cause the agent to putdown any block that is being held. Otherwise, if 
no block is being held, then find a block x± that is clear and is above a red block (expressed 
by (on*(on red))) and pick it up. Appendix B gives examples of more complex policies 
that are learned by our system in the experiments. 

5.3 Learning Taxonomic Decision Lists 

For a given relational MDP, define TZ^i to be the set of action-selection rules that have 
a length of at most I literals and whose class expression have depth at most d. Also, let 
H^i denote the policy space defined by decision lists whose rules are from lZd,i- Since the 
number of depth-bounded class expressions is finite there are a finite number of rules, and 
hence H^i is finite, though exponentially large. Our implementation of Learn-Policy, as 
used in the main API loop, learns a policy in H^i for user specified values of d and I. 

We use a Rivest-style decision-list learning approach (Rivest, 1987) — an approach also 
taken by Martin and Geffher (2000) for learning class-based policies. The primary difference 
between Martin and Geffher (2000) and our technique is the method for selecting individual 
rules in the decision list. We use a greedy, heuristic search, while previous work used an 
exhaustive enumeration approach. This difference allows us to find rules that are more 
complex, at the potential cost of failing to find some good simple rules that enumeration 
might discover. 

Recall from Section 3, that the training set given to Learn-Policy contains trajectories 
of the rollout policy. Our learning algorithm, however, is not sensitive to the trajectory 
structure (i.e., the order of trajectory elements) and thus, to simplify our discussion, we 
will take the input to our learner to be a training set D that contains the union of all 
the trajectory elements. This means that for a trajectory set that contains n length h 
trajectories, D will contain a total of n ■ h training examples. As described in Section 3, 
each training example in D has the form (s, tt(s), Q(s, a±), . . . , Q(s, a m )), where s is a state, 
tt(s) is the action selected in s by the previous policy, and Q(s,ai) is the Q-value estimate 
of Q w (s,ai). Note that in our experiments the training examples only contain values for 
the legal actions in a state. 

Given a training set D, a natural learning goal is to find a decision-list policy that for 
each training example selects an action with the maximum estimated Q-value. This learning 
goal, however, can be problematic in practice as often there are several best (or close to 
best) actions as measured by the true Q-function. In such case, due to random sampling, 
the particular action that looks best according to the Q-value estimates in the training set 
is arbitrary. Attempting to learn a concise policy that matches these arbitrary actions will 
be difficult at best and likely impossible. 

One approach (Lagoudakis & Parr, 2003) to avoiding this problem is to use statistical 
tests to determine the actions that are "clearly the best" (positive examples) and the ones 
that are "clearly not the best" (negative examples). The learner is then asked to find a 
policy that is consistent with the positive and negative examples. While this approach has 
shown some empirical success, it has the potential shortcoming of throwing away most of 
the Q-value information. In particular, it may not always be possible to find a policy that 
exactly matches the training data. In such cases, we would like the learner to make informed 
trade-offs regarding sub-optimal actions — i.e., prefer sub-optimal actions that have larger 
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Learn-Decision-List (D, d, I, b) 

// training set D, concept depth d, rule length I, beam width b 
L <— nil; 

while (D is not empty) 

R <— Learn-Rule(Z), d, I, b); 

D^D-{ded\R covers d}; 

L <— Extend-List(L, R); // add R to end of list 
Return L; 

Learn- Rule(Z), d, I, b) 

// training set D, concept depth d, rule length I, beam width b 
for each action type a // compute rule for each action type a 

R a <— Beam-Search(Z), d, I, b, a); 
Return argmax a Hvalue(i? a , D); 

Beam-Search (D, d, I, b, a) 

// training set D, concept depth d, rule length I, beam width b, action type a 

k <— arity of a; X <— (xi, . . . , // X is a sequence of action- argument variables 

L^{(ieC) \ x e X,C £ Cd[X]}; // construct the set of depth bounded candidate literals 

B <— { a(X) : nil }; i <— 1; // initialize beam to a single rule with no literals 

loop 

G = U{R e Tl d .i | R = Add-Literal(i?',/),ii' £ B^J £ L}; 
Bi <— Beam-Select(G, 6, D); // select best b heuristic values 

until £?j_i = Bf, // loop until there is no more improvement in heuristic 
Return argmax i?:eB .Hvalue(_R, D) // return best rule in final beam 

Figure 2: Pseudo-code for learning a decision list in H^i given training data D. The 
procedure Add-Literal(i?, I) simply returns a rule where literal I is added to the end of 
rule R. The procedure Beam-Select(G, w, D) selects the best b rules in G with different 
heuristic values. The procedure Hvalue(i2, D) returns the heuristic value of rule R relative 
to training data D and is described in the text. 

Q-values. With this motivation, below we describe a cost-sensitive decision-list learner that 
is sensitive to the full set of Q-values in D. The learning goal is roughly to find a decision 
list that selects actions with large cumulative Q- value over the training set. 

Learning List of Rules. We say that a decision list L covers a training example 
(s, 7r(s), Q(s, di), . . . , Q(s, a m )) if L suggests an action in state s. Given a set of training 
examples D, we search for a decision list that selects actions with high Q-value via an 
iterative set-covering approach carried out by Learn-Decision-List. Decision-list rules 
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are constructed one at a time and in order until the list covers all of the training examples. 
Pseudo-code for our algorithm is given in Figure 2. Initially, the decision list is the null list 
and does not cover any training examples. During each iteration, we search for a high quality 
rule R with quality measured relative to the set of currently uncovered training examples. 
The selected rule is appended to the current decision-list, and the training examples newly 
covered by the selected rule are removed from the training set. This process repeats until 
the list covers all of the training examples. The success of this approach depends heavily 
on the function Learn-Rule, which selects a good rule relative to the uncovered training 
examples — typically a good rule is one that selects actions with the best (or close to best) 
Q-value and also covers a significant number of examples. 

Learning Individual Rules. The input to the rule learner Learn-Rule is a set of 
training examples, along with depth and length parameters d and I, and a beam width b. 
For each action type a, the rule learner calls the routine Beam-Search to find a good rule 
R a in lZd,i for action type a. Learn-Rule then returns the rule R a with the highest value 
as measured by our heuristic, which is described later in this section. 

For a given action type a, the procedure Beam-Search generates a beam Bq,B\ . . ., 
where each Bi is a set of rules in TZ^i for action type a. The sets evolve by specializing 
rules in previous sets by adding literals to them, guided by our heuristic function. Search 
begins with the most general rule a(X) : nil, which allows any action of type a in any state. 
Search iteration i produces a set Bi that contains b rules with the highest different heuristic 
values among those in the following set 8 

G = Bi_i U{Re K dt i | R = Add-Literal(i?', I), R! e B^l e L} 

where L is the set of all possible literals with a depth of d or less. This set includes the 
current best rules (those in -Bj_i) and also any rule in lZd,l that can be formed by adding 
a new literal to a rule in Bi—\. The search ends when no improvement in heuristic value 
occurs, that is when Bi = Bi-\. Beam-Search then returns the best rule in Bi according 
to the heuristic. 

Heuristic Function. For a training instance (s, tt(s), Q(s, a±), . . . , Q(s, a m )), we de- 
fine the Q-advantage of taking action ctj instead of 7r(s) in state s by A(s,dj) = Q(s,ai) — 
Q(s, 7r(s)). Likewise, the Q-advantage of a rule R is the sum of the Q-advantages of actions 
allowed by R in s. Given a rule R and a set of training examples D, our heuristic function 
Hvalue(i?, D) is equal to the number of training examples that the rule covers plus the 
cumulative Q-advantage of the rule over the training examples. 9 Using Q-advantage rather 
than Q-value focuses the learner toward instances where large improvement over the previ- 
ous policy is possible. Naturally, one could consider using different weights for the coverage 
and Q-advantage terms, possibly tuning the weight automatically using validation data. 

8. Since many rules in TZd,l are equivalent, we must prevent the beam from filling up with semantically 
equivalent rules. Rather than deal with this problem via expensive equivalence testing we take an ad-hoc, 
but practically effective approach. We assume that rules do not coincidentally have the same heuristic 
value, so that ones that do must be equivalent. Thus, we construct beams whose members all have 
different heuristic values. We choose between rules with the same value by preferring shorter rules, then 
arbitrarily. 

9. If the coverage term is not included, then covering a zero Q-advantage example is the same as not 
covering it. But zero Q-advantage can be good (e.g., the previous policy is optimal in that state). 
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6. Random Walk Bootstrapping 

There are two issues that are critical to the success of our API technique. First, API is 
fundamentally limited by the expressiveness of the policy language and the strength of the 
learner, which dictates its ability to capture the improved policy described by the training 
data at each iteration. Second, API can only yield improvement if Improved- Trajectories 
successfully generates training data that describes an improved policy. For large classical 
planning domains, initializing API with an uninformed random policy will typically result 
in essentially random training data, which is not helpful for policy improvement. For 
example, consider the MDP corresponding to the 20-block blocks world with an initial 
problem distribution that generates random initial and goal states. In this case, a random 
policy is unlikely to reach a goal state within any practical horizon time. Hence, the 
rollout trajectories are unlikely to reach the goal, providing no guidance toward learning an 
improved policy (i.e., a policy that can more reliably reach the goal). 

Because we are interested in solving large domains such as this, providing guiding inputs 
to API is critical. In Fern, Yoon, and Givan (2003), we showed that by bootstrapping API 
with the domain-independent heuristic of the planner FF (Hoffmann & Nebel, 2001), API 
was able to uncover good policies for the blocks world, simplified logistics world (no planes), 
and stochastic variants. This approach, however, is limited by the heuristic's ability to 
provide useful guidance, which can vary widely across domains. 

Here we describe a new bootstrapping procedure for goal-based planning domains, based 
on random walks, for guiding API toward good policies. Our planning system, which is 
evaluated in Section 7, is based on integrating this procedure with API in order to find 
policies for goal-based planning domains. For non-goal-based MDPs, this bootstrapping 
procedure can not be directly applied, and other bootstrapping mechanisms must be used 
if necessary. This might include providing an initial non-trivial policy, providing a heuristic 
function, or some form of reward shaping (Mataric, 1994). Below, we first describe the 
idea of random- walk distributions. Next, we describe how to use these distributions in the 
context of bootstrapping API, giving a new algorithm LRW-API. 

6.1 Random Walk Distributions 

Throughout we consider an MDP M = (S, A, T, R, I) that correspond to goal-based plan- 
ning domains, as described in Section 5.1. Recall that each state s £ 5 corresponds to a 
planning problem, specifying a world state (via world facts) and a set of goal conditions (via 
goal facts). We will use the terms "MDP state" and "planning problem" interchangeably. 
Note that, in this context, I is a distribution over planning problems. For convenience we 
will denote MDP states as tuples s = (w,g), where w and g are the sets of world facts and 
goal facts in s respectively. 

Given an MDP state s = (w,g) and set of goal predicates G, we define s\g to be the 
MDP state (w, g') where g' contains those goal facts in g that are applications of a predicate 
in G. Given M and a set of goal predicates G, we define the n-step random-walk problem 
distribution lZW n (M,G) by the following stochastic algorithm: 

1. Draw a random state sq = (wo,go) from the initial state distribution I. 
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2. Starting at sq take n uniformly random actions 10 , giving a state sequence (so, . . . , s n ), 
where s n = (w n ,go) (recall that actions do not change goal facts). At each uniformly 
random action selection, we assume that an extra "no-op" action (that does not change 
the state) is selected with some fixed probability, for reasons explained below. 

3. Let g be the set of goal facts corresponding to the world facts in w n , so e.g., if 
w n = {on(a, b), clear(a)}, then g = {gon(a, b), gclear(a)}. Return the planning 
problem (MDP state) (so,<?)|g as the output. 

We will sometimes abbreviate lZW n (M, G) by lZW n when M and G are clear in context. 

Intuitively, to perform well on this distribution a policy must be able to achieve facts 
involving the goal predicates that typically result after an n-step random walk from an 
initial state. By restricting the set of goal predicates G we can specify the types of facts 
that we are interested in achieving — e.g., in the blocks world we may only be interested in 
achieving facts involving the "on" predicate. 

The random-walk distributions provide a natural way to span a range of problem diffi- 
culties. Since longer random walks tend to take us further from an initial state, for small 
n we typically expect that the planning problems generated by 1ZW n will become more 
difficult as n grows. However, as n becomes large, the problems generated will require far 
fewer than n steps to solve — i.e., there will be more direct paths from an initial state to the 
end state of a long random walk. Eventually, since S is finite, the problem difficulty will 
stop increasing with n. 

A question raised by this idea is whether, for large n, good performance on lZW n 
ensures good performance on other problem distributions of interest in the domain. In 
some domains, such as the simple blocks world 11 , good random- walk performance does 
seem to yield good performance on other distributions of interest. In other domains, such 
as the grid world (with keys and locked doors), intuitively, a random walk is very unlikely 
to uncover a problem that requires unlocking a sequence of doors. Indeed, since 1ZW n is 
insensitive to the goal distribution of the underlying planning domain, the random-walk 
distribution may be quite different. 

We believe that good performance on long random walks is often useful, but is only 
addressing one component of the difficulty of many planning benchmarks. To successfully 
address problems with other components of difficulty, a planner will need to deploy orthog- 
onal technology such as landmark extraction for setting subgoals (Hoffman, Porteous, & 
Sebastia, 2004). For example, in the grid world, if we could automatically set the subgoal 
of possessing a key for the first door, a long random-walk policy could provide a useful 
macro for getting that key. 

For the purpose of developing a bootstrapping technique for API, we limit our focus 
to finding good policies for long random walks. In our experiments, we define "long" by 
specifying a large walk length N. Theoretically, the inclusion of the "no-op" action in the 
definition of 1ZW ensures that the induced random-walk Markov chain 12 is aperiodic, and 

10. In practice, we only select random actions from the set of applicable actions in a state s», provided our 
simulator makes it possible to identify this set. 

11. In the blocks world with large n, lZW n generates various pairs of random block configurations, typically 
pairing states that are far apart — clearly, a policy that performs well on this distribution has captured 
significant information about the blocks world. 

12. We don't formalize this chain here, but various formalizations work well. 
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thus that the distribution over states reached by increasingly long random walks converges 
to a stationary distribution 13 . Thus 1ZW* = linin^oo 7£W n is well-defined, and we take 
good performance on 7ZW* to be our goal. 

6.2 Random- Walk Bootstrapping 

For an MDP M, we define M[/'] to be an MDP identical to M only with the initial state 
distribution replaced by I'. We also define the success ratio SR(n, M[I}) of tt on M[I] as 
the probability that tt solves a problem drawn from /. Also treating / as a random variable, 
the average length AL(7r, M[/]) of tt on M[I] is the conditional expectation of the solution 
length of tt on problems drawn from / given that tt solves I. Typically the solution length of 
a problem is taken to be the number of actions, however, when action costs are not uniform, 
the length is taken to be the sum of the action costs. Note that for the MDP formulation 
of classical planning domains, given in Section 5.1, if a policy tt achieves a high V(ir) then 
it will also have a high success ratio and low average cost. 

Given an MDP M and set of goal predicates G, our system attempts to find a good 
policy for M[1ZWn], where N is selected to be large enough to adequately approximate 
while still allowing tractable completion of the learning. Naively, given an initial 
random policy ttq, we could try to apply API directly. However, as already discussed, this 
will not work in general, since we are interested in planning domains where TZW* produces 
extremely large and difficult problems where random policies provide an ineffective starting 
point. 

However, for very small n (e.g., n = 1), 7ZW n typically generates easy problems, and 
it is likely that API, starting with even a random initial policy, can reliably find a good 
policy for 1ZW n . Furthermore, we expect that if a policy TT n performs well on lZW n , then 
it will also provide reasonably good, but perhaps not perfect, guidance on problems drawn 
from TZW m when m is only moderately larger than n. Thus, we expect to be able to find a 
good policy for 1ZW m by bootstrapping API with initial policy 7r n . This suggests a natural 
iterative bootstrapping technique to find a good policy for large n (in particular, for n = N). 

Figure 3 gives pseudo-code for the procedure LRW-API which integrates API and 
random-walk bootstrapping to find a policy for the long-random-walk problem distribution. 
Intuitively, this algorithm can be viewed as iterating through two stages: first, finding a 
hard enough distribution for the current policy (by increasing n); and, then, finding a good 
policy for the hard distribution using API. The algorithm maintains a current policy tt 
and current walk length n (initially n = 1). As long as the success ratio of tt on RW n is 
below the success threshold r, which is a constant close to one, we simply iterate steps of 
approximate policy improvement. Once we achieve a success ratio of r with some policy tt, 
the if-statement increases n until the success ratio of tt on TZW n falls below r — 5. That is, 
when tt performs well enough on the current n-step distribution we move on to a distribution 
that is slightly harder. The constant 5 determines how much harder and is set small enough 
so that tt can likely be used to bootstrap policy improvement on the harder distribution. 
(The simpler method of just increasing n by 1 whenever success ratio r is achieved will also 



13. The Markov chain may not be irreducible, so the same stationary distribution may not be reached from 
all initial states; however, we are only considering one initial state, described by /. 
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LRW-API (N, G, n, w, h, M, vr , 7) 

// max random-walk length N, goal predicates G 
// training set size n, sampling width w, horizon h, 
// MDP M , initial policy ttq, discount factor 7. 

7r <- 7r ; rn-1; 
loop 

if SR^(n) > r 

// Find harder n-step distribution for ir. 

n <— least i € [n, N] s.t. SR^(i) < r — 6, or N if none; 

M' = M[RW n (M, G)]; 

T <— Improved- Trajectories(n, w, h, M', tt); 
it Learn-Policy(T); 
until satisfied with it 

Return tt; 



Figure 3: Pseudo-code for LRW-API. SR^n) estimates the success ratio of it in planning 
domain D on problems drawn from 1ZW n (M, G) by drawing a set of problems and returning 
the fraction solved by tt. Constants r and 5 are described in the text. 

find good policies whenever this method does. This can take much longer, as it may run 
API repeatedly on a training set for which we already have a good policy.) 

Once n becomes equal to the maximum walk length N, we will have n = N for all future 
iterations. It is important to note that even after we find a policy with a good success ratio 
on IZWn it may still be possible to improve on the average length of the policy. Thus, 
we continue API on this distribution until we are satisfied with both the success ratio and 
average length of the current policy. 

7. Relational Planning Experiments 

In this section, we evaluate the LRW-API technique on relational MDPs corresponding to 
deterministic and stochastic classical planning domains. We first give results for a number of 
deterministic benchmark domains, showing promising results in comparison with the state- 
of-the-art planner FF (Hoffmann & Nebel, 2001), while also highlighting limitations of our 
approach. Next, we give results for several stochastic planning domains including those 
in the domain-specific track of the 2004 International Probabilistic Planning Competition 
(IPPC). All of the domain definitions and problem generators used in our experiments are 
available upon request. 

In all of our experiments, we use the policy learner described in Section 5.3 to learn 
taxonomic decision list policies. In all cases, the number of training trajectories is 100, and 
policies are restricted to rules with a depth bound d and length bound I. The discount 
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factor 7 was always one, and LRW-API was always initialized with a policy that selects 
random actions. We utilize a maximum-walk-length parameter N = 10, 000 and set r and 
5 equal to 0.9 and 0.1 respectively. 

7.1 Deterministic Planning Experiments 

We perform experiments in seven familiar STRIPS planning domains including those used 
in the AIPS-2000 planning competition, those used to evaluate TL-Plan in Bacchus and 
Kabanza (2000), and the Gripper domain. Each domain has a standard problem generator 
that accepts parameters, which control the size and difficulty of the randomly generated 
problems. Below we list each domain and the parameters associated with them. A detailed 
description of these domains can be found in Hoffmann and Nebel (2001). 

• Blocks World (n) : the standard blocks worlds with n blocks. 

• Freecell (s, c, /, I) : a version of Solitaire with s suits, c cards per suit, / freecells, and 
I columns. 

• Logistics (a,c,l,p) : the logistics transportation domain with a airplanes, c cities, I 
locations, and p packages. 

• Schedule (p) : a job shop scheduling domain with p parts. 

• Elevator (/, p) : elevator scheduling with / floors and p people. 

• Gripper (6) : a robotic gripper domain with b balls. 

• Briefcase (i) : a transportation domain with i items. 

LRW Experiments. Our first set of experiments evaluates the ability of LRW-API 
to find good policies for 1ZW*. Here we utilize a sampling width of one for rollout, since 
these are deterministic domains. Recall that in each iteration of LRW-API we compute an 
(approximately) improved policy and may also increase the walk length n to find a harder 
problem distribution. We continued iterating LRW-API until we observed no further 
improvement. The training time per iteration is approximately five hours. 14 Though the 
initial training period is significant, once a policy is learned it can be used to solve new 
problems very quickly, terminating in seconds with a solution when one is found, even for 
very large problems. 



Figure 4 provides data for each iteration of LRW-API in each of the seven domains 
with the indicated parameter settings. The first column, for each domain, indicates the 
iteration number (e.g., the Blocks World was run for 8 iterations). The second column 
records the walk length n used for learning in the corresponding iteration. The third and 
fourth columns record the SR and AL of the policy learned at the corresponding iteration 

14. This timing information is for a relatively unoptimized Scheme implementation. A reimplementation in 
C would likely result in a 5-10 fold speed-up. 
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Figure 4: Results for each iteration of LRW- API in seven deterministic planning domains. 
For each iteration, we show the walk length n used for learning, along with the success ratio 
(SR) and average length (AL) of the learned policy on both lZW n and 1ZW*. The final 
policy shown in each domain performs above r = 0.9 SR on walks of length N = 10, 000 
(with the exception of Logistics), and further iteration does not improve the performance. 
For each benchmark we also show the SR and AL of the planner FF on problems drawn 
from KW*. 



as measured on 100 problems drawn from lZW n for the corresponding value of n (i.e., 
the distribution used for learning). When this SR exceeds r, the next iteration seeks an 
increased walk length n. The fifth and sixth columns record the SR and AL of the same 
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policy, but measured on 100 problems drawn from the LRW target distribution 1ZW*, which 
in these experiments is approximated by 1Z Wn f° r N = 10, 000. 

So, for example, we see that in the Blocks World there are a total of 8 iterations, where 
we learn at first for one iteration with n = 4, one more iteration with n = 14, four iterations 
with n = 54, and then two iterations with n = 334. At this point we see that the resulting 
policy performs well on 1ZW*. Further iterations with n = N, not shown, showed no 
improvement over the policy found after iteration eight. In other domains, we also observed 
no improvement after iterating with n = N, and thus do not show those iterations. We 
note that all domains except Logistics (see below) achieve policies with good performance 
on TZWn by learning on much shorter 1ZW n distributions, indicating that we have indeed 
selected a large enough value of N to capture 1ZW*, as desired. 

General Observations. For several domains, our learner bootstraps very quickly 
from short random- walk problems, finding a policy that works well even for much longer 
random-walk problems. These include Schedule, Briefcase, Gripper, and Elevator. Typi- 
cally, large problems in these domains have many somewhat independent subproblems with 
short solutions, so that short random walks can generate instances of all the different typical 
subproblems. In each of these domains, our best LRW policy is found in a small number 
of iterations and performs comparably to FF on 1ZW*. We note that FF is considered a 
very good domain-independent planner for these domains, so we consider this a successful 
result. 

For two domains, Logistics 15 and Freecell, our planner is unable to find a policy with 
success ratio one on 1ZW* . We believe that this is a result of the limited knowledge represen- 
tation we allowed for policies for the following reasons. First, we ourselves cannot write good 
policies for these domains within our current policy language. For example, in logistics, one 
of the important concept is "the set containing all packages on trucks such that the truck is 
in the packages goal city" . However, the domain is defined in such a way that this concept 
cannot be expressed within the language used in our experiments. Second, the final learned 
decision lists for Logistics and Freecell, which are in Appendix B, contain a much larger 
number of more specific rules than the lists learned in the other domains. This indicates 
that the learner has difficulty finding general rules, within the language restrictions, that 
are applicable to large portions of training data, resulting in poor generalization. Third, 
the success ratio (not shown) for the sampling-based rollout policy, i.e., the improved policy 
simulated by Improved- Trajectories, is substantially higher than that for the resulting 
learned policy that becomes the policy of the next iteration. This indicates that Learn- 
Decision-List is learning a much weaker policy than the sampling-based policy generating 
its training data, indicating a weakness in either the policy language or the learning algo- 
rithm. For example, in the logistics domain, at iteration eight, the training data for learning 
the iteration-nine policy is generated by a sampling rollout policy that achieves success ratio 
0.97 on 100 training problems drawn from the same TZW^ distribution, but the learned 
iteration-nine policy only achieves success ratio 0.70, as shown in the figure at iteration 
nine. Extending our policy language to incorporate the expressiveness that appears to be 
required in these domains will require a more sophisticated learning algorithm, which is a 
point of future work. 

15. In Logistics, the planner generates a long sequence of policies with similar, oscillating success ratio that 
are elided from the table with an ellipsis for space reasons. 
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Figure 5: Results on standard problem distributions for seven benchmarks. Success ratio 
(SR) and average length (AL) are provided for both FF and our policy learned for the LRW 
problem distribution. For a given domain, the same learned LRW policy is used for each 
problem size shown. 

In the remaining domain, the Blocks World, the bootstrapping provided by increasingly 
long random walks appears particularly useful. The policies learned at each of the walk 
lengths 4, 14, 54, and 334 are increasingly effective on the target LRW distribution 1ZW*. 
For walks of length 54 and 334, it takes multiple iterations to master the provided level of 
difficulty beyond the previous walk length. Finally, upon mastering walk length 334, the 
resulting policy appears to perform well for any walk length. The learned policy is modestly 
superior to FF on in success ratio and average length. 

Evaluation on the Original Problem Distributions. In each domain we denote 
by 7r* the best learned LRW policy — i.e., the policy, from each domain, with the highest 
performance on 1ZW*, as shown in Figure 4. The taxonomic decision lists corresponding 
to 7r* for each domain is given in Appendix B. Figure 5 shows the performance of ir*, in 
comparison to FF, on the original intended problem distributions for each of our domains. 
We measured the success ratio of both systems by giving a time limit of 100 seconds to solve 
a problem. Here we have attempted to select the largest problem sizes previously used in 
evaluation of domain-specific planners, either in AIPS-2000 or in Bacchus and Kabanza 
(2000), as well as show a smaller problem size for those cases where one of the planners 
we show performed poorly on the large size. In each case, we use the problem generators 
provided with the domains, and evaluate on 100 problems of each size. 

Overall, these results indicate that our learned, reactive policies are competitive with 
the domain-independent planner FF. It is important to remember that these policies are 
learned in a domain-independent fashion, and thus LRW- API can be viewed as a general 
approach to generating domain-specific reactive planners. On two domains, Blocks World 
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and Briefcase, our learned policies substantially outperform FF on success ratio, especially 
on large domain sizes. On three domains, Elevator, Schedule, and Gripper, the two ap- 
proaches perform quite similarly on success ratio, with our approach superior in average 
length on Schedule but FF superior in average length on Elevator. 

On two domains, Logistics and Freecell, FF substantially outperforms our learned poli- 
cies on success ratio. We believe that this is partly due to an inadequate policy language, 
as discussed above. We also believe, however, that another reason for the poor performance 
is that the long-random-walk distribution TZW* does not correspond well to the standard 
problem distributions. This seems to be particularly true for Freecell. The policy learned 
for Freecell (4,2,2,4) achieved a success ratio of 93 percent on 1ZW*, however, for the stan- 
dard distribution it only achieved 36 percent. This suggests that 1ZW* generates problems 
that are significantly easier than the standard distribution. This is supported by the fact 
that the solutions produced by FF on the standard distribution are on average twice as long 
as those produced on 7ZW*. One likely reason for this is that it is easy for random walks to 
end up in dead states in Freecell, where no actions are applicable. Thus the random walk 
distribution will typically produce many problems where the goals correspond to such dead 
states. The standard distribution on the other hand will not treat such dead states as goals. 

7.2 Probabilistic Planning Experiments 

Here we present experiments in three probabilistic domains that are described in the prob- 
abilistic planning domain language PPDDL (Younes, 2003). 

• Ground Logistics (c, p) : a probabilistic version of logistics with no airplanes, with c 
cities and p packages. The driving action has a probability of failure in this domain. 

• Colored Blocks World (n) : a probabilistic blocks world with n colored blocks, where 
goals involve constructing towers with certain color patterns. There is a probability 
that moved blocks fall to the floor. 

• Boxworld (c, p) : a probabilistic version of full logistics with c cities and p packages. 
Transportation actions have a probability of going in the wrong direction. 

The Ground Logistics domain is originally from Boutilier et al. (2001), and was also used 
for evaluation in Yoon et al. (2002). The Colored Blocks World and Boxworld domains are 
the domains used in the hand-tailored track of IPPC in which our LRW-API technique was 
entered. In the hand-tailored track, participants were provided with problem generators for 
each domain before the competition and were allowed to incorporate domain knowledge into 
the planner for use at competition time. We provided the problem generators to LRW-API 
and learned policies for these domains, which were then entered into the competition. 

We have also conducted experiments in the other probabilistic domains from Yoon et al. 
(2002), including variants of the blocks world and a variant of Ground Logistics, some of 
which appeared in Fern et al. (2003). However, we do not show those results here since they 
are qualitatively identical to the deterministic blocks world results described above and the 
Ground Logistics results we show below. 

For our three probabilistic domains, we conducted LRW experiments using the same 
procedure as above. All parameters given to LRW-API were the same as above except 
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Figure 6: Results for each iteration of LRW-API in three probabilistic planning domains. 
For each iteration, we show the walk length n used for learning, along with the success ratio 
(SR) and average length (AL) of the learned policy on both 1ZW n and 1ZW*. For each 
benchmark, we show performance on the standard problem distribution of the policy whose 
performance is best on 7ZW*. 



that the sampling width used for rollout was set to w = 10, and r was set to 0.85 in order to 
account for the stochasticity in these domains. The results of these experiments are shown 
in Figure 6. These tables have the same form as Figure 4 only the last row given for each 
domain now gives the performance of 7r* on the standard distribution, i.e., problems drawn 
from the domains problem generator. For Colored Blocks World the problem generator 
produces problems whose goals are specified using existential quantifiers. For example, a 
simple goal may be "there exists blocks x and y such that x is red, y is blue and x is on y" . 
Since our policy language cannot directly handle existentially quantified goals we preprocess 
the planning problems produced by the problem generator to remove them. This was done 
by assigning particular block names to the existential variables, ensuring that the static 
properties of a block (in this case color) satisfied the static properties of the variable is 
was assigned to. In this domain, finding such an assignment was trivial, and the resulting 
assignment was taken to be the goal, giving a planning problem to which our learned policy 
was applied. Since the blocks world states are fully connected, the resulting goal is always 
guaranteed to be achievable. 



For Boxworld, LRW-API is not able to find a good policy for 1ZW* or the standard 
distribution. Again, as for deterministic Logistics and Freecell, we believe that this is 
primarily because of the restricted policy languages that is currently used by our learner. 
Here, as for those domains, we see that the decision list learned for Boxworld contains many 
very specific rules, indicating that the learner was not able to generalize well beyond the 
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training trajectories. For Ground Logistics, we see that LRW-API quickly finds a good 
policy for both and the standard distribution. 

For Colored Blocks World, we also see that LRW-API is able to quickly find a good 
policy for both and the standard distribution. However, unlike the deterministic 

(uncolored) blocks world, here the success ratio is observed to be less than one, solving 95 
percent of the problems. It is unclear, why LRW-API is not able to find a "perfect" policy. 
It is relatively easy to hand-code a policy for Colored Blocks World using the language of the 
learner, hence inadequate knowledge representation is not the answer. The predicates and 
action types for this domain are not the same as those in its deterministic counterpart and 
other stochastic variants that we have previously considered. This difference apparently 
interacts badly with our learners search bias, causing it to fail to find a perfect policy. 
Nevertheless, these two results, along with the probabilistic planning results not shown 
here, indicate that when a good policy is expressible in our language, LRW-API can 
find good policies in complex relational MDPs. This makes LRW-API one of the few 
techniques that can simultaneously cope with the complexity resulting from stochasticity 
and from relational structure in domains such as these. 

8. Related Work 

Boutilier et al. (2001) presented the first exact solution technique for relational MDPs 
based on structured dynamic programming. However, a practical implementation of the 
approach was not provided, primarily due to the need for the simplification of first-order 
logic formulas. These ideas, however, served as the basis for a logic-programming-based 
system (Kersting, Van Otterlo, & DeRaedt, 2004) that was successfully applied to blocks- 
world problems involving simple goals and a simplified logistics world. This style of approach 
is inherently limited to domains where the exact value functions and/or policies can be 
compactly represented in the chosen knowledge representation. Unfortunately, this is not 
generally the case for the types of domains that we consider here, particularly as the planning 
horizon grows. Nevertheless, providing techniques such as these that directly reason about 
the MDP model is an important direction. Note that our API approach essentially ignores 
the underlying MDP model, and simply interacts with the MDP simulator as a black box. 

An interesting research direction is to consider principled approximations of these tech- 
niques that can discover good policies in more difficult domains. This has been considered 
by Guestrin et al. (2003a), where a class-based MDP and value function representation was 
used to compute an approximate value function that could generalize across different sets 
of objects. Promising empirical results were shown in a multi-agent tactical battle domain. 
Presently the class-based representation does not support some of the representation fea- 
tures that are commonly found in classical planning domains (e.g., relational facts such as 
on(a, b) that change over time), and thus is not directly applicable in these contexts. How- 
ever, extending this work to richer representations is an interesting direction. Its ability to 
"reason globally" about a domain may give it some advantages compared to API. 

Our approach is closely related to work in relational reinforcement learning (RRL) (Dze- 
roski et al., 2001), a form of online API that learns relational value- function approxima- 
tions. Q-value functions are learned in the form of relational decision trees (Q-trees) and 
are used to learn corresponding policies (P-trees) . The RRL results clearly demonstrate the 
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difficulty of learning value- function approximations in relational domains. Compared to P- 
trees, Q-trees tend to generalize poorly and be much larger. RRL has not yet demonstrated 
scalability to problems as complex as those considered here — previous RRL blocks-world 
experiments include relatively simple goals 16 , which lead to value functions that are much 
less complex than the ones here. For this reason, we suspect that RRL would have difficulty 
in the domains we consider, precisely because of the value-function approximation step that 
we avoid; however, this needs to be experimentally tested. 

We note, however, that our API approach has the advantage of using an unconstrained 
simulator, whereas RRL learns from irreversible world experience (pure RL). By using 
a simulator, we are able to estimate the Q-values for all actions at each training state, 
providing us with rich training data. Without such a simulator, RRL is not able to directly 
estimate the Q-value for each action in each training state — thus, RRL learns a Q-tree to 
provide estimates of the Q- value information needed to learn the P-tree. In this way, value- 
function learning serves a more critical role when a simulator is unavailable. We believe, 
that in many relational planning problems, it is possible to learn a model or simulator 
from world experience — in this case, our API approach can be incorporated as the planning 
component of RRL. Otherwise, finding ways to either avoid learning or to more effectively 
learn relational value-functions in RRL is an interesting research direction. 

Researchers in classical planning have long studied techniques for learning to improve 
planning performance. For a collection and survey of work on "learning for planning do- 
mains" see Minton (1993) and Zimmerman and Kambhampati (2003). Two primary ap- 
proaches are to learn domain-specific control rules for guiding search-based planners e.g., 
Minton, Carbonell, Knoblock, Kuokka, Etzioni, and Gil (1989), Veloso, Carbonell, Perez, 
Borrajo, Fink, and Blythe (1995), Estlin and Mooney (1996), Huang, Selman, and Kautz 
(2000), Ambite, Knoblock, and Minton (2000), Aler, Borrajo, and Isasi (2002), and, more 
closely related, to learn domain-specific reactive control policies (Khardon, 1999a; Martin 
& Geffner, 2000; Yoon et al., 2002). 

Regarding the latter, our work is novel in using API to iteratively improve stand-alone 
control policies. Regarding the former, in theory, search-based planners can be iteratively 
improved by continually adding newly learned control knowledge — however, it can be diffi- 
cult to avoid the utility problem (Minton, 1988), i.e., being swamped by low utility rules. 
Critically, our policy-language bias confronts this issue by preferring simpler policies. Our 
learning approach is also not tied to having a base planner (let alone tied to a single partic- 
ular base planner), unlike most previous work. Rather, we only require a domain simulator. 

The ultimate goal of such systems is to allow for planning in large, difficult problems 
that are beyond the reach of domain-independent planning technology. Clearly, learning 
to achieve this goal requires some form of bootstrapping and almost all previous systems 
have relied on the human for this purpose. By far, the most common human-bootstrapping 
approach is "learning from small problems". Here, the human provides a small problem 
distribution to the learner, by limiting the number of objects (e.g., using 2-5 blocks in the 
blocks world), and control knowledge is learned for the small problems. For this approach to 
work, the human must ensure that the small distribution is such that good control knowledge 
for the small problems is also good for the large target distribution. In contrast, our long- 

16. The most complex blocks-world goal for RRL was to achieve on(^4, B) in an n block environment. We 
consider blocks-world goals that involve all n blocks. 
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random-walk bootstrapping approach can be applied without human assistance directly to 
large planning domains. However, as already pointed out, our goal of performing well on 
the LRW distribution may not always correspond well with a particular target problem 
distribution. 

Our bootstrapping approach is similar in spirit to the bootstrapping framework of "learn- 
ing from exercises" (Natarajan, 1989; Reddy & Tadepalli, 1997). Here, the learner is pro- 
vided with planning problems, or exercises, in order of increasing difficulty. After learning 
on easier problems, the learner is able to use its new knowledge, or skills, in order to boot- 
strap learning on the harder problems. This work, however, has previously relied on a 
human to provide the exercises, which typically requires insight into the planning domain 
and the underlying form of control knowledge and planner. Our work can be viewed as an 
automatic instantiation of "learning from exercises" , specifically designed for learning LRW 
policies. 

Our random-walk bootstrapping is most similar to the approach used in MlCRO-HlLLARY 
(Finkelstein & Markovitch, 1998), a macro-learning system for problem solving. In that 
work, instead of generating problems via random walks starting at an initial state, random 
walks were generated backward from goal states. This approach assumes that actions are 
invertible or that we are given a set of "backward actions". When such assumptions hold, 
the backward random-walk approach may be preferable when we are provided with a goal 
distribution that does not match well with the goals generated by forward random walks. 
Of course, in other cases forward random walks may be preferable. Micro-Hillary was 
empirically tested in the N x N sliding-puzzle domain; however, as discussed in that work, 
there remain challenges for applying the system to more complex domains with parameter- 
ized actions and recursive structure, such as familiar STRIPS domains. To the best of our 
knowledge, the idea of learning from random walks has not been previously explored in the 
context of STRIPS planning domains. 

The idea of searching for a good policy directly in policy space rather than value-function 
space is a primary motivation for policy-gradient RL algorithms. However, these algorithms 
have been largely explored in the context of parametric policy spaces. While this approach 
has demonstrated impressive success in a number of domains, it appears difficult to define 
such policy spaces for the types of planning problem considered here. 

Our API approach can be viewed as a type of reduction from planning or reinforcement 
learning to classification learning. That is, we solve an MDP by generating and solving 
a series of cost-sensitive classification problems. Recently, there have been several other 
proposals for reducing reinforcement learning to classification. Dietterich and Wang (2001) 
proposed a reinforcement learning approach based on batch value function approximation. 
One of the proposed approximations enforced only that the learned approximation assign 
the best action the highest value, which is a type of classifier learning. Lagoudakis and Parr 
(2003) proposed a classification-based API approach that is closely related to ours. The pri- 
mary difference is the form of the classification problem produced on each iteration. They 
generate standard multi-class classification problems, whereas we generate cost-sensitive 
problems. Bagnell, Kakade, Ng, and Schneider (2003) introduced a closely related algo- 
rithm for learning non-stationary policies in reinforcement learning. For a specified horizon 
time h, their approach learns a sequence of h policies. At each iteration, all policies are 
held fixed except for one, which is optimized by forming a classification problem via policy 
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rollout 17 . Finally, Langford and Zadrozny (2004) provide a formal reduction from rein- 
forcement learning to classification, showing that e-accurate classification learning implies 
near-optimal reinforcement learning. This approach uses an optimistic variant of sparse 
sampling to generate h classification problems, one for each horizon time step. 

9. Summary and Future Work 

We introduced a new variant of API that learns policies directly, without representing 
approximate value functions. This allowed us to utilize a relational policy language for 
learning compact policy representations. We also introduced a new API bootstrapping 
technique for goal-based planning domains. Our experiments show that the LRW-API 
algorithm, which combines these techniques, is able to find good policies for a variety of 
relational MDPs corresponding to classical planning domains and their stochastic variants. 
We know of no previous MDP technique that has been successfully applied to problems 
such as these. 

Our experiments also pointed to a number of weaknesses of our current approach. First, 
our bootstrapping technique, based on long random walks, does not always correspond 
well to the problem distribution of interest. Investigating other automatic bootstrapping 
techniques is an interesting direction, related to the general problems of exploration and 
reward shaping in reinforcement learning. Second, we have seen that limitations of our 
current policy language and learner are partly responsible for some of the failures of our 
system. In such cases, we must either: 1) depend on the human to provide useful features 
to the system, or 2) extend the policy language and develop more advanced learning tech- 
niques. Policy-language extensions that we are considering include various extensions to the 
knowledge representation used to represent sets of objects in the domain (in particular, for 
route- finding in maps/grids), as well as non-reactive policies that incorporate search into 
decision-making. 

As we consider ever more complex planning domains, it is inevitable that our brute- force 
enumeration approach to learning policies from trajectories will not scale. Presently our 
policy learner, as well as the entire API technique, makes no attempt to use the definition 
of a domain when one is available. We believe that developing a learner that can exploit 
this information to bias its search for good policies is an important direction of future work. 
Recently, Gretton and Thiebaux (2004) have taken a step in this direction by using logical 
regression (based on a domain model) to generate candidate rules for the learner. Devel- 
oping tractable variations of this approach is a promising research direction. In addition, 
exploring other ways of incorporating a domain model into our approach and other model- 
blind approaches is critical. Ultimately, scalable AI planning systems will need to combine 
experience with stronger forms of explicit reasoning. 



17. Here the initial state distribution is dictated by the policies at previous time steps, which are held fixed. 
Likewise the actions selected along the rollout trajectories are dictated by policies at future time steps, 
which are also held fixed. 
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Appendix A. Omitted Proofs 



Proposition 1. Let H be a finite class of deterministic policies. For any tt G H , and any 

set of n = e In Lj 1 trajectories drawn independently from D\, there is a 1 — 5 probability 
that every tt G H consistent with the trajectories satisfies V(tt) > V(tt) — 2V max (e + l h )- 

Proof: We first introduce some basic properties and notation that will be used below. For 
any deterministic policy tt, if tt is consistent with a trajectory t, then D%(t) is entirely 
determined by the underlying MDP transition dynamics. This implies that if two deter- 
ministic policies tt and tt' are both consistent with a trajectory t then D^(t) = D^(t). We 
will denote by v(t) the cumulative discounted reward accumulated by executing trajectory 
t. For any policy tt, we have that Vh(ir) = J2tD^{t) ■ v(t) where the summation is taken 
over all length h trajectories (or simply those that are consistent with tt). Finally for a set 
of trajectories T we will let DJ^T) = J2ter' DJ t (t) giving the cumulative probability of tt 
generating the trajectories in T. 

Consider a particular tt G H and any tt G H that is consistent with the n trajectories of 
tt. We will let T denote the set of all length h trajectories that are consistent with tt and 
f denote the set of trajectories that are consistent with tt. Following Khardon (1999b) we 
first give a standard argument showing that with high probability D^(T) > 1 — e. To see 
this consider the probability that tt is consistent with all n = e In ^j 1 trajectories of tt 
given that -D^(r) < 1 — e. The probability that this occurs is at most (1 — e) n < e~ en = jj^. 
Thus the probability of choosing such a tt is at most |#|rjy| = Thus, with probability at 
least 1 - S we know that £>£(f ) > 1 - e. Note that D£(f ) = D% (r). 

Now given the condition that -D^(r) > 1 — e we show that V ^(vr) > Vhirr) — 2eV max by 
considering the difference of the two value functions. 



v h (TT)-v h (Tt) = j2 D ut)-v(t)-j2Di(t)-v(t) 

= £ Dl(t).v(t)+ £ (DUt) - Dl(t)) ■ v(t) - ]T Di(t)-v(t) 
ter-f ternf tef-r 

= £ DUt)-v(t)+0- £ Dl(t).v(t) 
ter-f tef-r 

< WK(r-f) + #-r)] 

= V max [l-Dl(T) + l-Di(T)] 

< 2eV max 
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The third lines follows since D^(t) = D^{t) when it and tt are both consistent with t. The 
last line follows by substituting our assumption of D^(T) = D^(T) > 1 — e into the previous 
line. Combining this result with the approximation due to using a finite horizon, 

F(vr) - F(tt) < V h (7t) - V h (7r) + 2 7 A y max 

we get that with probability at least 1 — <5, V(ir) — V(tt) < 2V max (e + j h ), which completes 
the proof. □ 

Proposition 2. For any MDP with Q-advantage at least A* , and any < 5' < 1, if we 

have 

A* 

h > los 



w > 
A = 



° ^ max 
8V max \\ \A\ 

A* 
~2~ 



then for any state s, A(A, s) = A n (s) with probability at least 1 — 5'. 

Proof: Given a real valued random variable X bounded in absolute value by X max and 
an average X of w independently drawn samples of X, the additive Chernoff bound states 

that with probability at least 1 - 5, \E[X] - X\ < X mi /_ln<5 



Note that Q^(s,a) is the expectation of the random variable X(s,a) = R(s,a) + 
jV^_ 1 (T(s, a)) and Q(s,a) is simply an average of w independent samples of X(s,a). 
The Chernoff bound tells us that with probability at least 1 — \Q%(s, a) — Q(s, a)\ < 

^max\Z^ ^|7 ln<5 ' w here \A\ is the number of actions. Substituting in our choice of w we 
get that with probability at least 1 — 5', \Q%(s, a) — Q(s, a)\ < ^ is satisfied by all actions 
simultaneously. We also know that \Q w (s, a) — Q^(s,a)\ < j h V max , which by our choice 
of h gives, |Q 7r (s,a) — QJ l (s,a)\ < Combining these relationships we get that with 

probability at least 1 — 5', \Q n (s,a) — Q(s,a)\ < holds for all actions simultaneously. 

We can use this bound to show that with high probability the Q-value estimates for 
actions in A^^s) will be within a ^- range of each other, and other actions will be outside 
of that range. In particular, consider any action a € A n {s) and some other action a! . If 
a' £ A w (s) then we have that Q w (s,a) = Q w (s,a'). From the above bound we get that 
\Q(s,a) — Q(s,a')\ < Otherwise a' A 7T (s) and by our assumption about the MDP 
Q-advantage we get that Q n (s,a) — Q 7T (s,a') > A*. Using the above bound this implies 
that Q(s,a) — Q(s,a') > These relationships and the definition of A(A,s) imply that 
with probability at least 1 — 5' we have that A(A, s) = A K (s). □ 



Appendix B. Learned Policies 

Below we give the final taxonomic-decision-list policies that were learned for each domain 
in our experiments. Rather than write rules in the form a(x±, . . . , Xk) ■ Li A Li A • • • A L rn 
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we drop the variables from the head and simply write, a : L\ A L2 A • • • A L m . In addition 
below we use the notation R~* as short-hand for (it! -1 )* where R is a relation. When in- 
terpreting the policies, it is important to remember that for each rule of action type a, the 
preconditions for action type a are implicitly included in the constraints. Thus, the rules 
will often allow actions that are not legal, but those actions will never be considered by the 
system. 

Gripper 

1. MOVE: (Xx £ (NOT (GAT (CARRY -1 GRIPPER)))) A (X 2 £ (NOT (GAT (AT -1 AT-ROBBY)))) A (X 2 £ (GAT (NOT 
(CAT" 1 ROOM)))) A (Xi £ (CAT BALL)) 

2. DROP: (X-i £ (GAT -1 AT-ROBBY)) 

3. PICK: (Xi £ (GAT -1 (GAT (CARRY -1 GRIPPER)))) A (X ± £ (GAT -1 (NOT AT-ROBBY))) 

4. PICK: (X 2 £ (AT (NOT (GAT -1 ROOM)))) A (Xi £ (GAT -1 (NOT AT-ROBBY))) 

5. PICK: (Xi £ (GAT -1 (NOT AT-ROBBY))) 
Briefcase 

1. PUT-IN: (Xi £ (GAT -1 (NOT IS-AT))) 

2. MOVE: (X 2 £ (AT (NOT (CAT -1 LOCATION)))) A (X 2 £ (NOT (AT (GAT -1 CIS-AT)))) 

3. MOVE: (X 2 £ (GAT IN)) A (X 1 £ (NOT (CAT IN))) 

4. TAKE-OUT: {X x £ (CAT -1 IS-AT)) 

5. MOVE: (X 2 £ GIS-AT) 

6. MOVE: (X 2 £ (AT (GAT -1 CIS-AT))) 

7. PUT-IN: (Xi £ UNIVERSAL) 
Schedule 

1. DO-IMMERSION-PAINT: (Xi £ (NOT (PAINTED -1 X 2 ))) A (Xi £ (GPAINTED — 1 X 2 )) 

2. DO-DRILL-PRESS: (Xi £ (GHAS-HOLEO -1 X 3 )) A (Xi £ (GHAS-HOLEW - 1 X 2 )) 

3. DO-LATHE: (Xi £ (NOT (SHAPE -1 CYLINDRICAL))) A (Xi £ (GSHAPE -1 CYLINDRICAL)) 

4. DO-DRILL-PRESS: (Xi £ (GHAS-HOLEW - 1 X 2 )) 

5. DO-DRILL-PRESS: (Xi £ (GHAS-HOLEO -1 X 3 )) 

6. DO-GRIND: (Xi £ (NOT (SURFACE-CONDITION -1 SMOOTH))) A (Xi £ (GSURFACE-CONDITION -1 SMOOTH)) 

7. DO-POLISH: (Xi 6 (NOT (SURFACE-CONDITION -1 POLISHED))) A (Xi £ (GSURFACE-CONDITION -1 POLISHED)) 
S. DO-TIME-STEP: 

Elevator 

1. DEPART: (X 2 £ GSERVED) 

2. DOWN: (X 2 £ (DESTIN BOARDED)) A (X 2 £ (DESTIN GSERVED)) 

3. UP: (X 2 e (DESTIN BOARDED)) A (X 2 £ (DESTIN GSERVED)) A (X 2 £ (ABOVE (ORIGIN BOARDED))) A (X! (NOT 
(DESTIN BOARDED))) 

4. BOARD: (X 2 £ (NOT CSERVED)) A (X 2 £ GSERVED) 

5. UP: (X 2 6 (ORIGIN GSERVED)) A (X 2 £ (NOT (DESTIN BOARDED))) A (X 2 £ (NOT (DESTIN GSERVED))) A (X 2 6 
(ORIGIN (NOT CSERVED))) A (X 2 £ (ABOVE (DESTIN PASSENGER))) A (X x £ (NOT (DESTIN BOARDED))) 

6. DOWN: (X 2 £ (ORIGIN GSERVED)) A (X 2 £ (ORIGIN (NOT CSERVED))) A (Xx £ (NOT (DESTIN BOARDED))) 
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7. UP: (X 2 £ (NOT (ORIGIN BOARDED))) A (X 2 £ (NOT (DESTIN BOARDED))) 
FreeCell 

1. SENDTOHOME: (X x (CANSTACK -1 (CANSTACK (SUIT -1 (SUIT INCELL))))) A (X 5 £ (NOT GHOME)) 

2. MOVE-B: (X 2 £ (NOT (CANSTACK (ON GHOME)))) A (X 2 £ (CANSTACK GHOME)) A (X 2 £ (VALUE -1 (NOT 
COLSPACE))) A (Xi £ (CANSTACK -1 (SUIT -1 (SUIT BOTTOMCOL)))) 

3. MOVE: (Xx £ (CANSTACK -1 (ON (CANSTACK -1 (ON -1 GHOME))))) A (X 3 £ (CANSTACK (ON (SUIT -1 (SUIT BOT- 
TOMCOL))))) A (Xi £ (ON -1 BOTTOMCOL)) A (Xi £ (CANSTACK -1 (ON GHOME))) A (X 3 £ (ON -1 (CANSTACK -1 
(ON -1 (NOT (CANSTACK (VALUE -1 CELLSPACE))))))) A (X x £ (NOT (CANSTACK -1 (SUIT -1 (SUIT INCELL))))) A 
(X 3 £ (CANSTACK BOTTOMCOL)) A (X ± £ (SUIT -1 (SUIT (ON -1 (NOT (CANSTACK (VALUE -1 CELLSPACE))))))) 
A (Xi S (VALUE -1 (NOT COLSPACE))) (0 (ON -1 (NOT (CANSTACK -1 (SUIT -1 (SUIT INCELL)))))) A (X ± £ (NOT 
(CANSTACK -1 CHOME))) 

4. SENDTOHOME-B: (X 4 £ (NOT GHOME)) 

5. SENDTOHOME: (Xi £ (ON -1 (CANSTACK (CANSTACK -1 (SUIT -1 (SUIT INCELL)))))) A (X 5 £ (NOT GHOME)) 

6. SENDTOHOME: (Xi (ON - 1 (ON -1 GHOME))) A (Xi £ (CANSTACK -1 (NOT GHOME))) A {X x £ (CANSTACK -1 (NOT 
(ON -1 GHOME)))) (X 5 £ (NOT GHOME)) 

7. MOVE-B: (X! £ (NOT (CANSTACK -1 GHOME))) A (X 2 £ (VALUE -1 (NOT COLSPACE))) A {X x £ (CANSTACK -1 
(SUIT -1 (SUIT BOTTOMCOL)))) 

8. SENDTOFREE: (X x £ (ON - 1 (ON -1 GHOME))) A {X x £ (NOT GHOME)) 

9. SENDTOHOME: (X 5 £ (CANSTACK -1 (CANSTACK (ON GHOME)))) A (X 5 £ (NOT GHOME)) 

10. SENDTOHOME: (0 GHOME) (X 5 £ (VALUE -1 (NOT COLSPACE))) A (X 5 £ (NOT (CANSTACK -1 (ON -1 (NOT 
GHOME))))) A (Xi e (ON -1 (NOT (ON -1 GHOME)))) A (X 5 £ (NOT GHOME)) 

11. NEWCOLFROMFREECELL: (X x £ GHOME) 

12. SENDTOHOME: (X 5 £ (CANSTACK -1 (ON GHOME))) A (Xi £ GHOME) A (X 5 £ (NOT GHOME)) 

13. MOVE-B: (Xi £ (VALUE -1 (VALUE HOME))) A (X 2 £ (VALUE -1 (NOT COLSPACE))) A (Xi £ (CANSTACK -1 (SUIT -1 
(SUIT BOTTOMCOL)))) 

14. SENDTOHOME: (X x £ (CANSTACK -1 (ON -1 (CANSTACK -1 (SUIT -1 (SUIT INCELL)))))) A (X 5 £ (NOT GHOME)) 

15. SENDTOHOME: (X x 6 (ON -1 (ON -1 (CANSTACK -1 (ON - 1 (NOT GHOME)))))) (X 5 £ (NOT GHOME)) 

16. SENDTOFREE: (Xi £ (CANSTACK -1 (ON (ON -1 GHOME)))) A (Xi £ (SUIT -1 (SUIT BOTTOMCOL))) A (Xi 6 (ON -1 
BOTTOMCOL)) 

17. MOVE: (X 3 £ (ON -1 (CANSTACK -1 CLEAR))) A (Xi £ (ON -1 (CANSTACK (ON -1 (NOT (CANSTACK (VALUE -1 
CELLSPACE))))))) A (X 3 £ (NOT GHOME)) A (X x 6 GHOME) A (X 3 £ (CANSTACK BOTTOMCOL)) A (X 3 £ (ON -1 
(CANSTACK -1 (ON -1 (NOT (CANSTACK (VALUE -1 CELLSPACE))))))) A (Xi £ (NOT (CANSTACK -1 (SUIT -1 
(SUIT INCELL))))) A (Xi £ (ON -1 BOTTOMCOL)) A (Xi £ (SUIT -1 (SUIT (ON -1 (NOT (CANSTACK (VALUE -1 
CELLSPACE))))))) A (Xi £ (VALUE -1 (NOT COLSPACE))) A (Xi 6 (ON -1 (NOT (CANSTACK -1 (SUIT -1 (SUIT 
INCELL)))))) A (X x £ (NOT (CANSTACK -1 CHOME))) 

18. MOVE: (X x £ (SUIT -1 (SUIT CHOME))) A (X 3 £ (NOT GHOME)) A (X 3 £ (NOT (ON -1 GHOME))) A (X t £ (ON -1 
(CANSTACK -1 BOTTOMCOL))) 

19. SENDTOHOME: (X x £ (CANSTACK (ON (CANSTACK (ON GHOME))))) A (Xi £ GHOME) (X 5 £ (NOT GHOME)) 

20. SENDTOHOME: (X x £ (CANSTACK -1 (ON (CANSTACK -1 (ON -1 GHOME))))) A (Xi £ (NOT (SUIT -1 (SUIT BOT- 
TOMCOL)))) A (X 5 6 (NOT GHOME)) 

21. SENDTOFREE: (X x £ (CANSTACK (ON (CANSTACK (VALUE -1 CELLSPACE))))) A (Xi £ (CANSTACK CHOME)) 

22. SENDTOHOME: (Xi £ (CANSTACK -1 (SUIT -1 (SUIT INCELL)))) A (X x £ (ON -1 (NOT (CANSTACK (VALUE -1 
CELLSPACE))))) A (X 5 £ (NOT GHOME)) 

23. SENDTONEWCOL: (X x £ (CANSTACK (CANSTACK -1 (ON -1 GHOME)))) 

24. SENDTOFREE: (X ± £ (CANSTACK (ON -1 (CANSTACK -1 (ON -1 GHOME))))) A (X x £ (NOT (CANSTACK GHOME))) 
A (Xi £ (NOT (ON -1 GHOME))) A (Xi £ (ON -1 (NOT (CANSTACK -1 (SUIT -1 (SUIT INCELL)))))) 



111 



Fern, Yoon, & Givan 



25. SENDTOFREE: (Xi £ (ON - 1 (CANSTACK (CANSTACK -1 (ON" 1 GHOME))))) A (X 1 £ (NOT (CANSTACK BOTTOM- 
COL))) A (X! £ (NOT (CANSTACK" 1 (CANSTACK (ON GHOME))))) 

26. SENDTOFREE: (Xi £ (CANSTACK (ON" 1 (CANSTACK -1 (ON" 1 (NOT GHOME)))))) A (Xi £ (NOT (CANSTACK 
GHOME))) A (Xi 6 (CANSTACK (NOT (SUIT" 1 (SUIT BOTTOMCOL))))) 

27. SENDTOHOME: (Xi £ (CANSTACK -1 (CANSTACK (ON" 1 GHOME)))) A (Xi £ (ON" 1 (CANSTACK" 1 (ON" 1 (NOT 
GHOME))))) A (Xi 6 (NOT GHOME)) A (X 5 £ (NOT GHOME)) 

28. SENDTOFREE: (Xi e (CANSTACK (ON" 1 (CANSTACK" 1 (ON" 1 (NOT GHOME)))))) A (X 1 £ (CANSTACK (CANSTACK -1 
(ON" 1 GHOME)))) A (Xi £ (NOT GHOME)) A (Xi £ (ON -1 (CANSTACK -1 (ON" 1 (NOT (CANSTACK (VALUE -1 
CELLSPACE))))))) 

29. SENDTOFREE: (Xi £ (CANSTACK CHOME)) A (Xi £ (SUIT -1 (SUIT (CANSTACK" 1 (ON" 1 GHOME))))) 

30. SENDTOHOME: (X x £ GHOME) A (X x £ (SUIT" 1 (SUIT BOTTOMCOL))) A (X x £ (CANSTACK -1 (NOT (ON -1 
GHOME)))) A (X 5 £ (NOT GHOME)) 

31. SENDTOFREE: (Xi £ (CANSTACK" 1 (ON" 1 GHOME))) A (Xi £ (CANSTACK -1 (ON -1 (NOT GHOME)))) 

32. SENDTOFREE: (X x £ (CANSTACK (ON -1 GHOME))) A (X x £ (NOT GHOME)) A (X x £ (ON" 1 (CANSTACK -1 (ON -1 
(NOT GHOME))))) 

33. SENDTOHOME: (X x £ (ON -1 (CANSTACK -1 BOTTOMCOL))) A (Xi £ (CANSTACK -1 (NOT GHOME))) A (X s e 
(NOT GHOME)) 

34. SENDTOFREE: (Xi £ (CANSTACK (ON (CANSTACK -1 (ON -1 (NOT GHOME)))))) A (Xi £ (NOT (SUIT -1 (SUIT 
BOTTOMCOL)))) A (X t £ (NOT GHOME)) 

35. SENDTOHOME: (Xi £ (NOT (CANSTACK -1 GHOME))) A {X x £ (NOT (SUIT -1 (SUIT BOTTOMCOL)))) A (X 5 £ 
(NOT GHOME)) 

36. SENDTOFREE: (X x £ (NOT (ON -1 GHOME))) A {X x £ (CANSTACK (CANSTACK -1 (ON -1 (NOT GHOME))))) 

37. SENDTOFREE-B: (X x £ (NOT GHOME)) 

38. SENDTOFREE: (X x £ UNIVERSAL) 
Logistics 

1. FLY-AIRPLANE: (Xi £ (IN (GAT -1 AIRPORT))) A (Xi 6 (NOT (IN (GAT -1 (AT AIRPLANE))))) A (X 3 £ (NOT (GAT 
(IN -1 TRUCK)))) A (Xx £ (NOT (IN (GAT -1 (NOT AIRPORT))))) 

2. LOAD-TRUCK: (X 2 £ (IN (NOT (GAT -1 (NOT AIRPORT))))) A (Xi e (GAT -1 (GAT (IN -1 TRUCK)))) A (Xi 6 (NOT 
(CAT -1 LOCATION))) 

3. DRIVE-TRUCK: (X 3 £ (AT (AT -1 (GAT (IN -1 TRUCK))))) A (X 3 e (IN-CITY -1 (IN-CITY (AT AIRPLANE)))) A (Xi 
£ (AT -1 (NOT (GAT (IN -1 TRUCK))))) 

4. UNLOAD-TRUCK: {X x £ (GAT -1 (AT (IN OBJ)))) A (Xi £ (GAT -1 (AT OBJ))) A (X ± £ (NOT (GAT -1 (AT AIR- 
PLANE)))) A (X 2 £ (AT -1 (GAT (IN -1 TRUCK)))) A (Xi £ (GAT -1 (AT TRUCK))) 

5. FLY-AIRPLANE: (X 3 £ (GAT (IN -1 AIRPLANE))) A {X x £ (IN (NOT (GAT -1 (AT TRUCK))))) A (X x e (AT -1 (NOT 
(GAT (IN -1 TRUCK))))) 

6. UNLOAD- AIRPLANE: (X 2 £ (NOT (IN (GAT -1 (NOT AIRPORT))))) A (Xx £ (GAT -1 (AT AIRPLANE))) 

7. LOAD-TRUCK: (X 2 £ (IN (NOT (GAT -1 LOCATION)))) A (Xi £ (NOT (GAT -1 (AT TRUCK)))) A (Xi £ (GAT -1 
LOCATION)) 

8. UNLOAD-TRUCK: (X ± £ (GAT -1 (AT TRUCK))) A (X 2 £ (AT -1 AIRPORT)) A (X 2 £ (NOT (IN (GAT -1 (NOT AIR- 
PORT))))) A (Xi £ (GAT -1 (AT AIRPLANE))) 

9. FLY-AIRPLANE: (X 3 £ (AT (AT -1 (GAT (IN -1 TRUCK))))) A (Xi 6 (AT -1 (GAT (GAT -1 LOCATION)))) A (X x £ 
(NOT (AT -1 (CAT OBJ)))) 

10. DRIVE-TRUCK: (X x £ (IN (GAT -1 LOCATION))) A (X ± £ (AT -1 (NOT (GAT (IN -1 TRUCK))))) A {X x £ (AT -1 (NOT 
(AT AIRPLANE)))) 

11. UNLOAD-TRUCK: (X 2 £ (AT -1 (GAT (GAT -1 (NOT AIRPORT))))) A (X x £ (NOT (GAT -1 AIRPORT))) 

12. FLY-AIRPLANE: (X 3 6 (NOT (GAT (GAT -1 LOCATION)))) A (X x £ (AT -1 (GAT (AT -1 (CAT OBJ))))) A (X 3 £ (AT 
(NOT (GAT -1 (AT AIRPLANE))))) A (X 3 £ (AT OBJ)) A (X 2 6 (NOT (IN (GAT -1 AIRPORT)))) A (X 3 £ (NOT (AT 
(IN OBJ)))) 
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13. UNLOAD-TRUCK: (X 1 £ (GAT -1 AIRPORT)) 

14. LOAD-TRUCK: (X 1 £ (AT -1 (CAT (GAT -1 (AT AIRPLANE))))) A (Xi £ (NOT (GAT -1 LOCATION))) 

15. LOAD-TRUCK: (X ± £ (GAT -1 (CAT (GAT -1 (AT AIRPLANE))))) A (X 1 £ (NOT (GAT -1 (AT TRUCK)))) A (X ± £ 
(GAT -1 (AT (GAT -1 (AT AIRPLANE))))) 

16. LOAD-TRUCK: (Xi e (GAT -1 (NOT AIRPORT))) A (Xi e (NOT (GAT -1 (AT TRUCK)))) 

17. FLY-AIRPLANE: (X 3 £ (AT (GAT -1 (AT AIRPLANE)))) A (X 1 £ (AT -1 (CAT OBJ))) 

18. FLY-AIRPLANE: (X 3 £ (NOT (GAT (AT -1 (CAT OBJ))))) A (X 1 £ (AT -1 (GAT (AT -1 (CAT OBJ))))) A (Xi £ (AT -1 
(GAT (GAT -1 (AT TRUCK))))) 

19. LOAD-TRUCK: (Xi £ (GAT -1 (AT AIRPLANE))) A (Xi £ (NOT (GAT -1 (AT TRUCK)))) A (Xi £ (AT -1 (CAT OBJ))) 

20. LOAD-AIRPLANE: (Xi £ (GAT -1 AIRPORT)) A (Xi £ (NOT (CAT -1 LOCATION))) A (Xi £ (GAT -1 (NOT (AT 
AIRPLANE)))) A (X 2 £ (NOT (IN (GAT -1 (NOT AIRPORT))))) 

21. FLY-AIRPLANE: (X 3 £ (AT (GAT -1 (AT AIRPLANE)))) A (X 3 6 (NOT (AT TRUCK))) 

22. LOAD-TRUCK: (Xi 6 (AT -1 (CAT (GAT -1 (NOT AIRPORT))))) A (Xi £ (GAT -1 AIRPORT)) 

23. DRIVE-TRUCK: (X 3 £ (NOT (AT OBJ))) A (Xi £ (NOT (AT -1 (CAT OBJ)))) A (Xi G (AT -1 (GAT (GAT -1 LOCA- 
TION)))) 

24. LOAD-TRUCK: (X 1 £ (GAT -1 (CAT (CAT -1 AIRPORT)))) A (Xi £ (NOT (CAT -1 LOCATION))) 

25. FLY-AIRPLANE: (X 3 £ (AT (GAT -1 (AT AIRPLANE)))) A (X 1 £ (AT -1 (AT OBJ))) 

26. DRIVE-TRUCK: (Xi £ (IN OBJ)) 

27. DRIVE-TRUCK: (Xi £ (AT -1 (GAT (GAT -1 AIRPORT)))) A (X 3 £ (AT (GAT -1 AIRPORT))) A (Xi £ (AT -1 (NOT 
(AT AIRPLANE)))) 

28. FLY-AIRPLANE: (X 3 £ (CAT (GAT -1 (AT TRUCK)))) A (Xi £ (AT -1 (GAT (GAT -1 LOCATION)))) 

29. LOAD-TRUCK: (Xi £ (GAT -1 (AT OBJ))) A (Xi e (NOT (CAT -1 LOCATION))) 

30. DRIVE-TRUCK: (X 3 £ (AT (GAT -1 (AT AIRPLANE)))) A (X x £ (NOT (AT -1 (CAT OBJ)))) 

31. DRIVE-TRUCK: (X 3 £ (AT AIRPLANE)) A (X 3 £ (AT (GAT -1 (AT TRUCK)))) 

32. UNLOAD- AIRPLANE: (X 2 £ (NOT (AT -1 (CAT OBJ)))) A (Xi £ (GAT -1 (NOT AIRPORT))) 

33. DRIVE-TRUCK: (X 3 £ (AT (GAT -1 (AT TRUCK)))) 

34. LOAD-TRUCK: (Xi £ (AT -1 (NOT AIRPORT))) A (X ± £ (GAT -1 AIRPORT)) 

35. FLY-AIRPLANE: (X 3 £ (AT (GAT -1 LOCATION))) 

36. FLY-AIRPLANE: (Xi £ (IN OBJ)) A (X 3 £ (NOT (GAT (GAT -1 LOCATION)))) A (X x £ (NOT (IN (GAT -1 AIRPORT)))) 
A (X 3 £ (NOT (AT (IN OBJ)))) A (X x £ (AT -1 (GAT (AT -1 (CAT OBJ)))))) 

37. DRIVE-TRUCK: (Xi £ (AT -1 (AT AIRPLANE))) 

38. LOAD- AIRPLANE: (X ± £ (GAT -1 (NOT AIRPORT))) 
Blocks World 

1. STACK: (X 2 £ (GON HOLDING)) A (X 2 £ (CON - * (MIN GON))) A (Xj £ (GON - * ON- TABLE)) 

2. PUTDOWN: 

3. UNSTACK: (X t £ (ON - * (ON (MIN GON)))) A (X 2 £ (CON - * (ON* (MIN GON)))) 

4. UNSTACK: (X 2 £ (ON -1 (GON CLEAR))) A (X 2 £ (GON* (ON~* (MIN GON)))) A (X ± £ (ON - * (GON ON-TABLE))) 
(Xi 6 (GON - * (NOT CLEAR))) 

5. PICKUP: (Xi 6 (GON -1 (CON - * (MIN GON)))) A (X x £ (GON -1 CLEAR)) (Xi e (GON -1 (CON - * ON-TABLE))) 

6. UNSTACK: (X 2 £ (CON - * (GON -1 CLEAR))) A (X x £ (GON -1 (ON - * (MIN GON)))) A (X t £ (GON -1 (CON* 
CLEAR))) 
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7. UNSTACK: (X! £ (NOT (GON - * (MIN GON)))) 

8. UNSTACK: (X 2 g (GON ON-TABLE)) A (Xi g (GON -1 (CON - * (MIN GON)))) A (Xi g (GON -1 CLEAR)) 

9. UNSTACK: (Xi g (NOT (CON - * (MIN GON)))) A (X 2 6 (ON - * (GON -1 ON- TABLE))) A (X 2 £ (GON - * (NOT ON- 
TABLE))) A (Xi g (GON* (GON - * ON-TABLE))) (X 1 g (GON - * (NOT CLEAR))) 

10. UNSTACK: (X 2 6 (NOT (CON CLEAR))) A (Xi g (GON -1 (CON - * ON-TABLE))) 

11. UNSTACK: (Xi g (GON -1 CLEAR)) A (Xi g (ON - * (ON (MIN GON))) 
Ground Logistics 

1. LOAD: (X 2 g (NOT (IN (GIN -1 CITY)))) A (Xj g (NOT (CIN -1 CITY))) A (X x g (GIN -1 CITY)) 

2. UNLOAD: (Xi 6 (GIN -1 X 3 )) 

3. DRIVE: (Xi g (IN (GIN -1 X 3 ))) 

4. DRIVE: (X 3 g (NOT (GIN BLOCK))) A (X 3 g (IN (GIN -1 CITY))) A (Xi 6 CAR) (X 2 g CLEAR) 

5. DRIVE: (X 3 g (IN (GIN -1 RAIN))) A (X x 6 TRUCK) 
Colored Blocks World 

1. PICK-UP-BLOCK-FROM: (X 2 g (NOT (CON-TOP-OF - * TABLE))) A (X 2 g (GON-TOP-OF -1 (ON-TOP-OF BLOCK))) 

2. PUT-DOWN-BLOCK-ON: (X 2 g (CON-TOP-OF -1 (CON-TOP-OF -1 BLOCK))) A (X 2 g (GON-TOP-OF HOLDING)) A 
(X 2 g (CON-TOP-OF - * TABLE)) 

3. PICK-UP-BLOCK-FROM: (X 2 g (NOT (CON-TOP-OF BLOCK))) A (X x g (ON-TOP-OF - * (GON-TOP-OF -1 TABLE))) 
A (X 2 g (GON-TOP-OF* (GON-TOP-OF -1 BLOCK))) A (X 2 g (NOT (CON-TOP-OF -1 BLOCK))) A (X 2 g (ON-TOP- 
OF -1 (GON-TOP-OF BLOCK))) A (Xi g (GON-TOP-OF* (GON-TOP-OF -1 BLOCK))) 

4. PICK-UP-BLOCK-FROM: (Xi g (NOT (CON-TOP-OF - * TABLE))) A (Xi £ (GON-TOP-OF -1 (CON-TOP-OF - * TA- 
BLE))) A (Xi g (GON-TOP-OF - * (ON-TOP-OF -1 BLOCK))) 

5. PUT-DOWN-BLOCK-ON: (X 2 6 (CON-TOP-OF -1 (ON-TOP-OF -1 TABLE))) A (X 2 g (GON-TOP-OF HOLDING)) A (X 2 
g (CON-TOP-OF - * TABLE)) 

6. PUT-DOWN-BLOCK-ON: (X 2 6 (CON-TOP-OF (ON-TOP-OF BLOCK))) A (X 1 g (GON-TOP-OF -1 (GON-TOP-OF -1 
BLOCK))) 

7. PUT-DOWN-BLOCK-ON: (X 2 g (GON-TOP-OF HOLDING)) A (X 2 g (CON-TOP-OF - * TABLE)) 

8. PUT-DOWN-BLOCK-ON: (X 2 g TABLE) 

9. PICK-UP-BLOCK-FROM: (X 2 g (NOT (CON-TOP-OF - * TABLE))) A (X 2 £ (GON-TOP-OF -1 (CON-TOP-OF - * TA- 
BLE))) 

10. PICK-UP-BLOCK-FROM: (X 1 g (GON-TOP-OF -1 (CON-TOP-OF -1 TABLE))) A (X 2 6 TABLE) A {X x £ (GON-TOP-OF 
(GON-TOP-OF BLOCK))) A (Xi £ (GON-TOP-OF (ON-TOP-OF -1 TABLE))) 

11. PICK-UP-BLOCK-FROM: (X 2 £ (ON-TOP-OF (CON-TOP-OF BLOCK))) A (X x £ (GON-TOP-OF -1 (CON-TOP-OF -1 
TABLE))) 

12. PICK-UP-BLOCK-FROM: (X 2 £ (ON-TOP-OF -1 BLOCK)) A (X 2 £ (NOT (CON-TOP-OF - * TABLE))) A (X 2 e (GON- 
TOP-OF - * (ON-TOP-OF -1 BLOCK))) A (X 2 £ (GON-TOP-OF* (ON-TOP-OF -1 BLOCK))) 

13. PICK-UP-BLOCK-FROM: (X x £ (GON-TOP-OF -1 (GON-TOP-OF -1 TABLE))) 
Boxworld 

1. DRIVE-TRUCK: (X 2 g (GBOX-AT-CITY (BOX-AT-CITY -1 X 3 ))) A (X 3 g (NOT (CAN-FLY (TRUCK- AT-CITY (NOT 
PREVIOUS))))) A (X 3 £ (CAN-DRIVE -1 PREVIOUS)) A (X 2 6 (NOT (CAN-FLY (TRUCK- AT-CITY (NOT PREVI- 
OUS))))) A (X 3 g (NOT (CAN-FLY (BOX-AT-CITY BOX)))) A (X 2 £ (CAN-DRIVE (CAN-DRIVE (BOX-AT-CITY BOX)))) 
A (X 3 £ (NOT (CAN-FLY (TRUCK- AT-CITY (BOX-ON-TRUCK (GBOX-AT-CITY -1 CITY)))))) 

2. UNLOAD-BOX-FROM-TRUCK-IN-CITY: (X x £ (GBOX-AT-CITY -1 (TRUCK- AT-CITY PREVIOUS))) A (X 3 £ (GBOX- 
AT-CITY BOX)) A (X 3 £ (NOT (BOX-AT-CITY PREVIOUS))) A (Xi £ (GBOX-AT-CITY -1 (CAN-DRIVE -1 (CAN- 
DRIVE -1 (CAN-FLY CITY))))) A (X 2 e (BOX-ON-TRUCK (GBOX-AT-CITY -1 PREVIOUS))) 

3. DRIVE-TRUCK: (X x £ (BOX-ON-TRUCK (GBOX-AT-CITY -1 X 3 ))) A (X 2 6 (NOT (CAN-DRIVE (TRUCK- AT-CITY 
(BOX-ON-TRUCK (GBOX-AT-CITY -1 CITY)))))) 
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4. DRIVE-TRUCK: (X 3 £ (CAN-DRIVE (BOX-AT-CITY PREVIOUS))) A (X 2 £ (CAN-FLY (CAN-DRIVE -1 (BOX-AT-CITY 
BOX)))) A (X 3 £ (CAN-DRIVE (CAN-FLY (TRUCK-AT-CITY TRUCK)))) A (X 2 £ (NOT (CAN-DRIVE (TRUCK- AT-CITY 
(BOX-ON-TRUCK (GBOX- AT-CITY -1 CITY)))))) A (X 2 £ PREVIOUS) A (X 2 £ (CAN-DRIVE (CAN-DRIVE X 3 ))) A (X 3 
6 (NOT (TRUCK-AT-CITY (BOX-ON-TRUCK (GBOX- AT-CITY - 1 CITY))))) A (X 3 £ (NOT (CAN-FLY PREVIOUS))) A 
(X 3 £ (CAN-DRIVE (NOT (BOX-AT-CITY BOX)))) A (X 2 £ (CAN-DRIVE (CAN-DRIVE -1 X 3 ))) A (X 3 £ (CAN-DRIVE 
(NOT (TRUCK-AT-CITY TRUCK)))) 

5. LOAD-BOX-ON-TRUCK-IN-CITY: (Xi £ (GBOX- AT-CITY -1 (CAN-DRIVE (TRUCK-AT-CITY TRUCK)))) A (X 3 £ (NOT 
(PLANE- AT-CITY PREVIOUS))) A (X 3 £ (CAN-DRIVE (CAN-DRIVE -1 (CAN-FLY CITY)))) A (X 3 £ (CAN-DRIVE -1 
(NOT (TRUCK-AT-CITY (NOT PREVIOUS))))) 

6. UNLOAD-BOX-FROM-TRUCK-IN-CITY: (X 3 £ (GBOX- AT-CITY (BOX-ON-TRUCK -1 TRUCK))) A (X 3 £ (NOT (CAN- 
FLY (TRUCK-AT-CITY (BOX-ON-TRUCK (GBOX- AT-CITY -1 CITY)))))) A (Xi £ (GBOX- AT-CITY - 1 CITY)) 

7. DRIVE-TRUCK: (Xi £ (BOX-ON-TRUCK (GBOX- AT-CITY - 1 PREVIOUS))) A (X 3 £ (CAN-DRIVE (GBOX- AT-CITY 
(GBOX- AT-CITY -1 PREVIOUS)))) A (X 3 £ (NOT (PLANE- AT-CITY PLANE))) A (X 2 £ (NOT (CAN-FLY (GBOX-AT- 
CITY (GBOX- AT-CITY -1 PREVIOUS))))) 

8. FLY-PLANE: (X 1 £ (BOX-ON-PLANE (GBOX- AT-CITY - 1 X 3 ))) 

9. UNLOAD-BOX-FROM-PLANE-IN-CITY: (X x £ (GBOX- AT-CITY - 1 PREVIOUS)) 

10. FLY-PLANE: (X 2 £ (NOT (CAN-DRIVE (TRUCK-AT-CITY (BOX-ON-TRUCK (GBOX- AT-CITY -1 CITY)))))) A (X 2 £ 
(GBOX-AT-CITY BOX)) A (X 3 £ (NOT (PLANE- AT-CITY PREVIOUS))) A (Xi £ (NOT PREVIOUS)) 

11. LOAD-BOX-ON-PLANE-IN-CITY: (Xi £ (GBOX-AT-CITY -1 (CAN-FLY PREVIOUS))) A (X 3 £ (NOT (TRUCK-AT-CITY 
(NOT PREVIOUS)))) A (X 3 £ (NOT (CAN-DRIVE (TRUCK-AT-CITY (BOX-ON-TRUCK (GBOX-AT-CITY -1 CITY)))))) 

12. DRIVE-TRUCK: (Xi £ (BOX-ON-TRUCK (GBOX-AT-CITY -1 X 3 ))) A (X 2 £ (NOT (CAN-DRIVE (CAN-FLY PREVI- 
OUS)))) A (X 2 £ (CAN-DRIVE -1 (CAN-FLY CITY))) 

13. LOAD-BOX-ON-TRUCK-IN-CITY: (Xi £ (GBOX-AT-CITY -1 PREVIOUS)) 
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